All of abramdemski's Comments + Replies

It's imaginable to do this work but not remember any of it, i.e. avoid having that work leave traces that can accumulate, but that seems like a delicate, probably unnatural carving.

Is the implication here that modern NNs don't do this? My own tendency would be to think that they are doing a lot of this -- doing a bunch of reasoning which gets thrown away rather than saved. So it seems like modern NNs have simply managed to hit this delicate unnatural carving. (Which in turn suggests that it is not so delicate, and even, not so unnatural.)

1Tsvi Benson-Tilsen4mo
Yes, I think there's stuff that humans do that's crucial for what makes us smart, that we have to do in order to perform some language tasks, and that the LLM doesn't do when you ask it to do those tasks, even when it performs well in the local-behavior sense.

Attempting to write out the holes in my model. 

  • You point out that looking for a perfect reward function is too hard; optimization searches for upward errors in the rewards to exploit. But you then propose an RL scheme. It seems to me like it's still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
  • It seems like you have a few tools to combat this form of critique:
    • Model capacity. If the policy that exploits the upward errors is too complex to fit in the model, it cannot be
... (read more)
2Alex Turner4mo
(Huh, I never saw this -- maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.) I really appreciate these thoughts! I would say "that isn't how on-policy RL works; it doesn't just intelligently find increasingly high-reinforcement policies; which reinforcement events get 'exploited' depends on the exploration policy." (You seem to guess that this is my response in the next sub-bullets.) shrug, too good to be true isn't a causal reason for it to not work, of course, and I don't see something suspicious in the correlations. Effective learning algorithms may indeed have nice properties we want, especially if some humans have those same nice properties due to their own effective learning algorithms! 

A good question. I've never seen it happen myself; so where I'm standing, it looks like short emergence examples are cherry-picked.

What report is the image pulled from?

I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can't keep dropping in probability forever, since evidence is lost. The probability only becomes small - but this means if you run for long enough you do in fact expect the transition.

LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.

You're correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.

But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren't inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.

I disagree. The crux of the matter is the limited memory of an LLM. If the LLM had unlimited memory, then every Luigi act would further accumulate a little evidence against Waluigi. But because LLMs can only update on so much context, the probability drops to a small one instead of continuing to drop to zero. This makes waluigi inevitable in the long run.

I agree. Though is it just the limited context window that causes the effect? I may be mistaken, but from my memory it seems like they emerge sooner than you would expect if this was the only reason (given the size of the context window of gpt3).

So to see if I have this right, the difference is I'm trying to point at a larger phenomenon and you mean teleosemantics to point just at the way beliefs get constrained to be useful.

This doesn't sound quite right to me. Teleosemantics is a purported definition of belief. So according to the teleosemantic picture, it isn't a belief if it's not trying to accurately reflect something. 

The additional statement I prefaced this with, that accuracy is an instrumentally convergent subgoal, was intended to be an explanation of why this sort of "belief" is a c... (read more)

FIAT (by another name) was previously proposed in the book On Intelligence. The version there had a somewhat predictive-processing-like story where the cortex makes plans by prediction alone; so reflective agency (really meaning: agency arising from the cortex) is entirely dependent on building a self-model which predicts agency. Other parts of the brain are responsible for the reflexes which provide the initial data which the self-model gets built on (similar to your story).

The continuing kick toward higher degrees of agency comes from parts of the brain ... (read more)

1Tsvi Benson-Tilsen7mo
I don't recall seeing that theory in the first quarter of the book, but I'll look for it later. I somewhat agree with your description of the difference between the theories (at least, as I imagine a predictive processing flavored version). Except, the theories are more similar than you say, in that FIAT would also allow very partial coherentifying, so that it doesn't have to be "follow these goals, but allow these overrides", but can rather be, "make these corrections towards coherence; fill in the free parameters with FIAT goals; leave all the other incoherent behavior the way it is". A difference between the theories (though I don't feel I can pass the PP ITT) is that FIAT allows, you know, agency, as in, non-myopic goal pursuit based on coherent-world-model-building, whereas PP maybe strongly hints against that? I'm confused by this; are these supposed to be mutually exclusive? What's "their own goals"? [After thinking more: Oh like you're saying, here's what it would look like to have a goal that can't be explained as a FIAT goal? I'll assume that in the rest of this comment.] Agreed. I'm not sure I buy that it can't be inferred, even the first time. Maybe you have fairly built-in instincts that aren't about the whole courtship thing, but cause you to feel good when you're around someone. So you seek being around them, and pay attention to them. You try to get them interested in being around you. This builds up the picture of a goal of being together for a long time. (This is a pretty poor explanation as stated; if this explanation works, why wouldn't you just randomly fall in love with anyone you do a favor for? But this is why it's at least plausible to me that the behavior could come from a FIAT-like thing. And maybe that's actually the case with homosexual intercourse in the 1800s.) Maybe courtship is especially much like this, but in general things sort-of-well-explainable as imitation seem like admissible falsifications of FIAT, e.g. if there are also

One thing I see as different between your perspective and (my understanding of) teleosemantics, so far:

You make a general case that values underlie beliefs.

Teleosemantics makes a specific claim that the meaning of semantic constructs (such as beliefs and messages) is pinned down by what it is trying to correspond to.

Your picture seems very compatible with, EG, the old LW claim that UDT's probabilities are really a measure of caring - how much you care about doing well in a variety of scenarios. 

Teleosemantics might fail to analyze such probabilities a... (read more)

1G Gordon Worley III7mo
I think this is exactly right. I often say things like "accurate maps are extremely useful to things like survival, so you and every other living thing has strong incentives to draw accurate maps, but this is contingent on the extent to which you care about e.g. survival". So to see if I have this right, the difference is I'm trying to point at a larger phenomenon and you mean teleosemantics to point just at the way beliefs get constrained to be useful.

OK. So far it seems to me like we share a similar overall take, but I disagree with some of your specific framings and such. I guess I'll try and comment on the relevant posts, even though this might imply commenting on some old stuff that you'll end up disclaiming.

1G Gordon Worley III7mo
Cool. For what it's worth, I also disagree with many of my old framings. Basically anything written more than ~1 year ago is probably vaguely but not specifically endorsed.

(Following some links...) What's the deal with Holons? 

Your linked article on epistemic circularity doesn't really try to explain itself, but rather links to this article, which LOUDLY doesn't explain itself. 

I haven't read much else yet, but here is what I think I get:

  • You use Godel's incompleteness theorem as part of an argument that meta-rationalism can't make itself comprehensible to rationalism.
  • You think (or thought at the time) that there's a thing, Holons, or Holonic thinking, which is fundamentally really really hard to explain, but which
... (read more)
1G Gordon Worley III7mo
Oh man I kind of wish I could go back in time and wipe out all the cringe stuff I wrote when I was trying to figure things out (like why did I need to pull in Godel or reify my confusion?). With that said, here's some updated thoughts on holons. I'm not really familiar with OOO, so I'll be going off your summary here. I think I started out really not getting what the holon idea points at, but I understood enough to get myself confused in new ways for a while. So first off there's only ~1 holon, such that it doesn't make sense to talk about it as anything other than the whole world. Maybe you could make some case for many overlapping holons centered around each point in the universe expanding out to it's Hubble volume, but I think that's probably not helpful. Better to think of the holon as just the whole world, so really it's just a weird cybernetics term for talking about the world. The trouble was I really didn't fully grasp the way that relative and absolute truth are not one and the same. So I was actually still fully trapped within my ontology, but holons seemed like a way to pull pre-ontological reality existing on its own inside of ontology. OOO mostly sounds like being confused about ontology, specifically a kind of reification of the confusion that comes from not realizing that it's maps all the way down, i.e. you only experience the world through, and it's only through experiencing non-experience that you get to taste reality, which is an extremely mysterious answer trying to point at a thing that happens all the time but we literally can't notice it because noticing it destroys it.

It's a good point. I suppose I was anchored by the map/territory analogy to focus on world-to-word fit. The part about Communicative Action and Rational Choice at the very end is supposed to gesture at the other direction. 

Intuitively, I expect it's going to be a bit easier to analyze world-to-word fit first. But I agree that a full picture should address both.

so "5+ autoresponses" would be a single category for decisionmaking purposes

I agree that something in this direction could work, and plausibly captures something about how humans reason. However, I don't feel satisfied. I would want to see the idea developed as part of a larger framework of bounded rationality.

UDT gives us a version of "never be harmed by information" which is really nice, as far as it goes. In the cases which UDT helps with, we don't need to do anything tricky, where we carefully decide which information to look at -- UDT simply isn't har... (read more)

I was using "crazy" to mean something like "too different from what we are familiar with", but I take your point. It's not clear we should want to preserve Aumann.

To be clear, rejecting Aumann's account of common knowledge would make his proof unsound (albeit still valid), but it would not solve the general "disagreement paradox", the counterintuitive conclusion that rational disagreements seem to be impossible: There are several other arguments which lead this conclusion, and which do not rely on any notion of common knowledge.

Interesting, thanks for pointing this out!

Each time we come up against this barrier, it is tempting to add a new layer of indirection in our designs for AI systems.

I strongly agree with this characterization. Of my own "learning normativity" research direction, I would say that it has an avoiding-the-question nature similar to what you are pointing out here; I am in effect saying: Hey! We keep needing new layers of indirection! Let's add infinitely many of them! 

One reason I don't spend very much time staring the question "what is goodness/wisdom" in the eyes is, the CEV write-up and other th... (read more)

I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing. 

I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that i... (read more)

2Alex Turner10mo
This seems to prove too much in general, although it could be "right in spirit." If the AI cares about diamonds, finds out about the training process but experiences no more update events in that moment, and then sets its learning rate to zero, then I see no way for the Update God to intervene to make the agent care about its training process.  I was responding to: I bet you can predict what I'm about to say, but I'll say it anyways. The point of RL is not to entrain cognition within the agent which predicts the reward. RL first and foremost chisels cognition into the network.  So I think the statement "how well do the agent's motivations predict the reinforcement event" doesn't make sense if it's cast as "manage a range of hypotheses about the origin of reward (e.g. training-process vs actually making diamonds)." I think it does make sense if you think about what behavioral influences ("shards") within the agent will upweight logits on the actions which led to reward.

I expect this argument to not hold, 

Seems like the most significant remaining disagreement (perhaps).

1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the "distance covered" to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)

So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient t... (read more)

2Alex Turner10mo
This seems stronger than the claim I'm making. I'm not saying that the agent won't deceptively model us and the training process at some point. I'm saying that the initial cognition will be e.g. developed out of low-level features which get reliably pinged with lots of gradients and implemented in few steps. Think edge detectors. And then the lower-level features will steer future training. And eventually the agent models us and its training process and maybe deceives us. But not right away.  You can make the "some subnetwork just models its training process and cares about getting low loss, and then gets promoted" argument against literally any loss function, even some hypothetical "perfect" one (which, TBC, I think is a mistaken way of thinking). If I buy this argument, it seems like a whole lot of alignment dreams immediately burst into flame. No loss function would be safe. This conclusion, of course, does not decrease in the slightest the credibility of the argument. But I don't perceive you to believe this implication. Anyways, here's another reason I disagree quite strongly with the argument, because I perceive it to strongly privilege the training-modeling hypothesis. There are an extreme range of motivations and inner cognitive structures which can be upweighted by the small number of gradients observed early in training.  The network doesn't "observe" more than that, initially. The network just gets updated by the loss function. It doesn't even know what the loss function is. It can't even see the gradients. It can't even remember the past training data, except insofar as the episode is retained in its recurrent weights. The EG CoT finetuning will just etch certain kinds of cognition into the network. Why not? Claims (left somewhat vague because I have to go soon, sorry for lack of concreteness): 1. RL develops a bunch of contextual decision-influences / shards 1. EG be near diamonds, make diamonds, play games 2. Agents learn to plan, and severa

My main complaint with this, as I understand it, is that builder/breaker encourages you to repeatedly condition on speculative dangers until you're exploring a tiny and contorted part of solution-space (like worst-case robustness hopes, in my opinion). And then you can be totally out-of-touch from the reality of the problem.

On my understanding, the thing to do is something like heuristic search, where "expanding a node" means examining that possibility in more detail. The builder/breaker scheme helps to map out heuristic guesses about the value of differen... (read more)

2Alex Turner1y
Your comment here is great, high-effort, contains lots of interpretive effort. Thanks so much! Let me see how this would work.  1. Breaker: "The agent might wirehead because caring about physical reward is a high-reward policy on training" 2. Builder: "Possible, but I think using reward signals is still the best way forward. I think the risk is relatively low due to the points made by reward is not the optimization target." 3. Breaker: "So are we assuming a policy gradient-like algorithm for the RL finetuning?"  4. Builder: "Sure." 5. Breaker: "What if there's a subnetwork which is a reward maximizer due to LTH?" 6. ... If that's how it might go, then sure, this seems productive.  I don't think I was mentally distinguishing between "the idealized builder-breaker process" and "the process as TurnTrout believes it to be usually practiced." I think you're right, I should be critiquing the latter, but not necessarily how you in particular practice it, I don't know much about that. I'm critiquing my own historical experience with the process as I imperfectly recall it. Yes, I think this was most of my point. Nice summary. I expect this argument to not hold, but I'm not yet good enough at ML theory to be super confident. Here are some intuitions. Even if it's true that LTH probabilistically ensures the existence of undesired-subnetwork, 1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the "distance covered" to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.) 2. You're always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size."  3. Even if the agent is motivated both by the

The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

I think that it generally seems like a good idea to have solid theories of two different things:

  1. What is the thing we are hoping to teach the AI?
  2. What is the training story by which we mean to teach it?

I read your above paragraph as maligning (1) in favor of (2). In order... (read more)

I said: 

The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans.
In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.

You said: 

In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for people).

Thinking ... (read more)

2Alex Turner1y
True, but I'm also uncertain about the relative difficulty of relatively novel and exotic value-spreads like "I value doing the right thing by humans, where I'm uncertain about the referent of humans", compared to "People should have lots of resources and be able to spend them freely and wisely in pursuit of their own purposes" (the latter being values that at least I do in fact have).

If you commit to the specific view of outer/inner alignment, then now you also want your loss function to "represent" that goal in some way.

I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases -- or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.

This is because writing down the desired inductive biases as an explicit prior can help us to understand... (read more)

I doubt this due to learning from scratch.

I expect you'll say I'm missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is "figuring out the human prior", because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I'm fine with that. I am not dogmatic on orthodox bayesianism. I do not even like utility functions.

Insofar as the ques

... (read more)
2Alex Turner1y
I agree, this does seem like it was a language dispute, I no longer perceive us as disagreeing on this point. 

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not

... (read more)
2Alex Turner1y
I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.  This isn't a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models). And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point? (I know you noted its fuzziness, so maybe you're already sympathetic to responses like the one I just gave?)

This doesn't seem relevant for non-AIXI RL agents which don't end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?

With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.

Output-based evalu... (read more)

I'm a bit uncomfortable with the "extreme adversarial threats aren't credible; players are only considering them because they know you'll capitulate" line of reasoning because it is a very updateful line of reasoning. It makes perfect sense for UDT and functional decision theory to reason in this way. 

I find the chicken example somewhat compelling, but I can also easily give the "UDT / FDT retort": since agents are free to choose their policy however they like, one of their options should absolutely be to just go straight. And arguably, the agent shou... (read more)

The agent's own generative model also depends on (adapts to, is learned from, etc.) the agent's environment. This last bit comes from "Discovering Agents".

"Having own generative model" is the shakiest part.

What it means for the agent to "have a generative model" is that the agent systematically corrects this model based on its experience (to within some tolerable competence!).

It probably means that storage, computation, and maintenance (updates, learning) of the model all happen within the agent's boundaries: if not, the agent's boundaries shall be widened

... (read more)

I think the main problem is that expected utility theory is in many ways our most well-developed framework for understanding agency, but, makes no empirical predictions, and in particular does not tie agency to other important notions of optimization we can come up with (and which, in fact, seem like they should be closely tied to agency).

I'm identifying one possible source of this disconnect.

The problem feels similar to trying to understand physical entropy without any uncertainty. So it's like, we understand balloons at the atomic level, but we notice th... (read more)

I think Bob still doesn't really need a two-part strategy in this case. Bob knows that Alice believes "time and space are relative", so Bob believes this proposition, even though Bob doesn't know the meaning of it. Bob doesn't need any special-case rule to predict Alice. The best thing Bob can do in this case still seems like, predict Alice based off of Bob's own beliefs.

(Perhaps you are arguing that Bob can't believe something without knowing what that thing means? But to me this requires bringing in extra complexity which we don't know how to handle anyw... (read more)

Another example of this happening comes when thinking about utilitarian morality, which by default doesn't treat other agents as moral actors (as I discuss here).

Interesting point! 

Maintain a model of Alice's beliefs which contains the specific things Alice is known to believe, and use that to predict Alice's actions in domains closely related to those beliefs.

It sounds to me like you're thinking of cases on my spectrum, somewhere between Alice>Bob and Bob>Alice. If Bob thinks Alice knows strictly more than Bob, then Bob can just use Bob's own b... (read more)

3Richard Ngo1y
No, I'm thinking of cases where Alice>Bob, and trying to gesture towards the distinction between "Bob knows that Alice believes X" and "Bob can use X to make predictions". For example, suppose that Bob is a mediocre physicist and Alice just invented general relativity. Bob knows that Alice believes that time and space are relative, but has no idea what that means. So when trying to make predictions about physical events, Bob should still use Newtonian physics, even when those calculations require assumptions that contradict Alice's known beliefs.

I've often repeated scenarios like this, or like the paperclip scenario.

My intention was never to state that the specific scenario was plausible or default or expected, but rather, that we do not know how to rule it out, and because of that, something similarly bad (but unexpected and hard to predict) might happen

The structure of the argument we eventually want is one which could (probabilistically, and of course under some assumptions) rule out this outcome. So to me, pointing it out as a possible outcome is a way of pointing to the inadequacy of o... (read more)

If opens are thought of as propositions, and specialization order as a kind of ("logical") time, 

Up to here made sense.

with stronger points being in the future of weaker points, then this says that propositions must be valid with respect to time (that is, we want to only allow propositions that don't get invalidated).

After here I was lost. Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that?

This setting moti

... (read more)
1Vladimir Nesov1y
This was just defining/motivating terms (including "validity") for this context, the technical answer is to look at the definition of specialization preorder, when it's being suggestively called "logical time". If an open is a "proposition", and a point being contained in an open is "proposition is true at that point", and a point stronger in specialization order than another point is "in the future of the other point", then in these terms we can say that "if a proposition is true at a point, it's also true at a future point", or that "propositions are valid with respect to time going forward", in the sense that their truth is preserved when moving from a point to a future point. Logical time is intended to capture decision making, with future decisions advancing the agent's point of view in logical time. So if an agent reasons only in terms of propositions valid with respect to advancement of logical time, then any knowledge it accumulated remains valid as it makes decisions, that's some of the motivation for looking into reasoning in terms of such propositions. This is mostly about how domain theory describes computations, the interesting thing is how the computations are not necessarily in the domains at all, they only leave observations there, and it's the observations that the opens are ostensibly talking about, yet the goal might be to understand the computations, not just the observations (in program semantics, the goal is often to understand just the observations though, and a computation might be defined to only be its observed behavior). So one point I wanted to make is to push against the perspective where points of a space are what the logic of opens is intended to reason about, when the topology is not Frechet (has nontrivial specialization preorder). Yeah, I've got nothing, just a sense of direction and a lot of theory to study, or else there would've been a post, not just a comment triggered by something on a vaguely similar topic. So this thread i

As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.

I largely agree. In retrospect, a large part of the point of this post for me is that it's practical to think of decision-theoretic agents as having expected value estimates for everything without having a utility function anywhere, which the expected values are "expectations of". 

A utility function is a gadget for turning probability distributions into expected values. This object makes sense in ... (read more)

Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with "takeover without holding power over someone". Specifically this person described enlightenment in terms close to "I was ready to pack my things and leave. But the poison was already in me. My self died soon after that."

It's possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).

Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by "guessing true names". I think the approach makes sense, but my argument for why this is the case does differ from John's arguments here.

You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So "zero" is best understood as "zero within some tolerance". So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently. 

This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable t... (read more)

My objection is actually mostly to the example itself. As you mention: Compare with the example: This is analogous to the case of... trying to contain a malign AI which is already not on our side.

So, I think the other answers here are adequate, but not super satisfying. Here is my attempt.

The frame of "generalization failures" naturally primes me (and perhaps others) to think of ML as hunting for useful patterns, but instead fitting to noise. While pseudo-alignment is certainly a type of generalization failure, it has different connotations: that of a system which has "correctly learned" (in the sense of internalizing knowledge for its own use), but still does not perform as intended.

The mesa-optimizers paper defines inner optimizers as performing ... (read more)

This definitely isn't well-defined, and this is the main way in which ELK itself is not well-defined and something I'd love to fix. That said, for now I feel like we can just focus on cases where the counterexamples obviously involve the model knowing things (according to this informal definition). Someday in the future we'll need to argue about complicated border cases, because our solutions work in every obvious case. But I think we'll have to make a lot of progress before we run into those problems (and I suspect that progress will mostly resolve the am

... (read more)

Yeah, sorry, poor wording on my part. What I meant in that part was "argue that the direct translator cannot be arbitrarily complex", although I immediately mention the case you're addressing here in the parenthetical right after what you quote.

Ah, I just totally misunderstood the sentence, the intended reading makes sense.

Well, it might be that a proposed solution follows relatively easily from a proposed definition of knowledge, in some cases. That's the sort of solution I'm going after at the moment. 

I agree that's possible, and it does seem like a... (read more)

Job applicants often can't start right away; I would encourage you to apply!

Infradistributions are a generalization of sets of probability distributions. Sets of probability distributions are used in "imprecise bayesianism" to represent the idea that we haven't quite pinned down the probability distribution. The most common idea about what to do when you haven't quite pinned down the probability distribution is to reason in a worst-case way about what that probability distribution is. Infrabayesianism agrees with this idea.

One of the problems with imprecise bayesianism is that they haven't come up with a good update rule -- turns ... (read more)

One of the problems with imprecise bayesianism is that they haven't come up with a good update rule -- turns out it's much trickier than it looks. You can't just update all the distributions in the set, because [reasons i am forgetting]. Part of the reason infrabayes generalizes imprecise bayes is to fix this problem.

The reason you can't just update all the distributions in the set is, it wouldn't be dynamically consistent. That is, planning ahead what to do in every contingency versus updating and acting accordingly would produce different policies.

The... (read more)

I'd be happy to chat about it some time (PM me if interested). I don't claim to have a fully worked out solution, though. 

Any more detailed thoughts on its relevance? EG, a semi-concrete ELK proposal based on this notion of truth/computationalism? Can identifying-running-computations stand in for direct translation?

The main difficulty is that you still need to translate between the formal language of computations and something humans can understand in practice (which probably means natural language). This is similar to Dialogic RL. So you still need an additional subsystem for making this translation, e.g. AQD. At which point you can ask, why not just apply AQD directly to a pivotal[1] action?

I'm not sure what the answer is. Maybe we should apply AQD directly, or maybe AQD is too weak for pivotal actions but good enough for translation. Or maybe it's not even good en... (read more)

Your definition requires that we already know how to modify Alice to have Clippy's goals. So your brute force idea for how to modify clippy to have Alice's knowledge doesn't add very much; it still relies on a magic goal/belief division, so giving a concrete algorithm doesn't really clarify.

Really good to see this kind of response.

1Ben Pace2y
Ah, very good point. How interesting… (If I’d concretely thought of transferring knowledge between a bird and a dog this would have been obvious.)

To be pedantic, "pragmatism" in the context of theories of knowledge means "knowledge is whatever the scientific community eventually agrees on" (or something along those lines -- I have not read deeply on it). [A pragmatist approach to ELK would, then, rule out "the predictor's knowledge goes beyond human science" type counterexamples on principle.] 

What you're arguing for is more commonly called contextualism. (The standards for "knowledge" depend on context.)

I totally agree with contextualism as a description of linguistic practice, but I think the... (read more)

3Charlie Steiner2y
Pragmatism's a great word, everyone wants to use it :P But to be specific, I mean more like Rorty (after some Yudkowskian fixes) than Pierce.

I think a lot of the values we care about are cultural, not just genetic. A human raised without culture isn't even clearly going to be generally intelligent (in the way humans are), so why assume they'd share our values?

Estimations of the information content of this part are discussed by Eric Baum in What is Thought?, although I do not recall the details.

I find that plausible, a priori. Mostly doesn't affect the stuff in the talk, since that would still come from the environment, and the same principles would apply to culturally-derived values as to environment-derived values more generally. Assuming the hardwired part is figured out, we should still be able to get an estimate of human values within the typical-human-value-distribution-for-a-given-culture from data which is within the typical-human-environment-distribution-for-that-culture.

I agree. There's nothing magical about "once". I almost wrote "once or twice", but it didn't sit well with the level of caution I would prefer be the norm. While your analysis seems correct, I am worried if that's the plan. 

I think a safety team should go into things with the attitude that this type of thing is important a last-line-of-defense, but should never trigger. The plan should involve a strong argument that what's being build is safe. In fact if this type of safeguard gets triggered, I would want the policy to be to go back to the drawing boa... (read more)

Yeah fully agreed.
Load More