# All of TurnTrout's Comments + Replies

TurnTrout's shortform feed

Against CIRL as a special case of against quickly jumping into highly specific speculation while ignoring empirical embodiments-of-the-desired-properties.

Just because we write down English describing what we want the AI to do ("be helpful"), propose a formalism (CIRL), and show good toy results (POMDPs where the agent waits to act until updating on more observations), that doesn't mean that the formalism will lead to anything remotely relevant to the original English words we used to describe it. (It's easier to say "this logic enables nonmonotonic r... (read more)

Where I agree and disagree with Eliezer

I addressed this distinction previously, in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it's external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.

The important disanalogy

I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we h... (read more)

I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":

2. cares about something external (not shallow function of local sensory data)
3. cares about something specific and external

(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.

I think I need to distinguish two versions of feat 3:

1. there is a reliable
evhub's Shortform
• Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.

Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a p... (read more)

2Evan Hubinger3d
It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.
Where I agree and disagree with Eliezer

OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn't found any qualitatively new obstacles which might present deep challenges to my new view on alignment.

Here's one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever ... (read more)

Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.

But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).

The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:

• for AGIs there's a principal (humans) that we want to align the AGI to
• for humans there is no principal - our values can be whatever. Or if you take evolution as the principal, the alignment problem wasn't solved.
AGI Ruin: A List of Lethalities

here's no special reason why that one pathway would show up in a large majority of the [satisficer]'s candidate plans.

There is a special reason, and it's called "instrumental convergence." Satisficers tend to seek power.

AGI Ruin: A List of Lethalities

there's an astounding amount of regularity in what we end up caring about.

Yes, this is my claim. Not that eg >95% of people form values which we would want to form within an AGI.

AGI Ruin: A List of Lethalities

Is the distinction between 2 and 3 that "dog" is an imprecise concept, while "diamond" is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is 'maximize the number of dogs' and 3 is 'maximize the number of diamonds'.

Feat #2 is: Design a mind which cares about anything at all in reality which isn't a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don't know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year ... (read more)

AGI Ruin: A List of Lethalities

Hm, I'll give this another stab. I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary?

I don't see Eliezer claiming 'there's no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head'.

Let me distinguish three alignment feats:

1. Producing a mind which terminally values sensory entities.
2. Producing a mind which reliably termin

I understand the first part of your comment as "sure, it's possible for minds to care about reality, but we don't know how to target value formation so that the mind cares about a particular part of reality." Is this a good summary?

Yes!

I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!

True! Though everyone already agreed (e.g., EY asserted this in the OP) that it's possible in principle. The updatey thing would be if the case of the human genome / brain development sugg... (read more)

AGI Ruin: A List of Lethalities

One of the problems with English is that it doesn't natively support orders of magnitude for "unreliable." Do you mean "unreliable" as in "between 1% and 50% of people end up with part of their values not related to objects-in-reality", or as in "there is no a priori reason why anyone would ever care about anything not directly sensorially observable, except as a fluke of their training process"? Because the latter is what current alignment paradigms mispredict, and the former might be a reasonable claim about what really happens for human beings.

AGI Ruin: A List of Lethalities

"learning from evolution" even more complicated (evolution -> protein -> something -> brain vs. evolution -> protein -> brain)

ah, no, this isn't what I'm saying. Hm. Let me try again.

The following is not a handwavy analogy, it is something which actually happened

1. Evolution found the human genome.
2. The human genome specifies the human brain.
3. The human brain learns most of its values and knowledge over time.
4. Human brains reliably learn to care about certain classes of real-world objects like dogs.

Therefore, somewhere in the "gen... (read more)

2FireStormOOO16d
That does seem worth looking at and there's probably ideas worth stealing from biology. I'm not sure you can call that a robustly aligned system that's getting bootstrapped though. Existing in a society of (roughly) peers and the lack of a huge power disparity between any given person and the rest of humans is anologous to the AGI that can't take over the world yet. Humans that aquire significant power do not seem aligned wrt what a typical person would profess to and outwardly seem to care about. I think your point still mostly follows despite that; even when humans can be deceptive and power seeking, there's an astounding amount of regularity in what we end up caring about.
AGI Ruin: A List of Lethalities

we're gonna be designing the thing analogous to evolution, and not the brain. We don't pick the actual weights in these transformers, we just design the architecture and then run stochastic gradient descent or some other meta-learning algorithm.

But, ah, the genome also doesn't "pick the actual weights" for the human brain which it later grows. So whatever the brain does to align people to care about latent real-world objects, I strongly believe that that process must be compatible with blank-slate initialization and then learning.

That meta-learning algorit

1[anonymous]17d
I think I disagree with you, but I don't really understand what you're saying or how these analogies are being used to point to the real world anymore. It seems to me like you might be taking something that makes the problem of "learning from evolution" even more complicated (evolution -> protein -> something -> brain vs. evolution -> protein -> brain) and using that to argue the issues are solved, in the same vein as the "just don't use a value function" people. But I haven't read shard theory, so, GL. You mean, we are specifying the ATCG strands, or we are specifying the "architecture" behind how DNA influences the development of the human body? It seems to me like we are definitely also choosing how the search for the correct ATCG strands and how they're identified, in this analogy. The DNA doesn't "align" new babies out of the womb, it's just a specification of how to copy the existing, already """aligned""" code.
AGI Ruin: A List of Lethalities

I'm not talking about running evolution again, that is not what I meant by "the process by which humans come to reliably care about the real world." The human genome must specify machinery which reliably grows a mind which cares about reality. I'm asking why we can't use the alignment paradigm leveraged by that machinery, which is empirically successful at pointing people's values to certain kinds of real-world objects.

2[anonymous]17d
Ah, I misunderstood. Well, for starters, because if the history of ML is anything to go by, we're gonna be designing the thing analogous to evolution, and not the brain. We don't pick the actual weights in these transformers, we just design the architecture and then run stochastic gradient descent or some other meta-learning algorithm. That meta-learning algorithm is going to be what decides to go in the DNA, so in order to get the DNA right, we will need to get the meta-learning algorithm correct. Evolution doesn't have much to teach us about that except as a negative example. But (I think) the answer is similar to this:
AGI Ruin: A List of Lethalities

Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?

Likewise, when you wrote,

This isn't to say that nothing in the system’s goal (whatever goal accidentally ends up being inner-optimized over) could ever point to anything in the environment by accident

Where is the accident? Did evolution accidentally find a way to reliably orient terminal human values towards the real world? Do people each, individually, acci... (read more)

3Rob Bensinger16d
2Matthew "Vaniver" Graves17d
IMO this process seems pretty unreliable and fragile, to me. Drugs are popular; video games are popular; people-in-aggregate put more effort into obtaining imaginary afterlives than life extension or cryonics. But also humans have a much harder time 'optimizing against themselves' than AIs will, I think. I don't have a great mechanistic sense of what it will look like for an AI to reliably care about the real world.
2[anonymous]17d
Humans came to their goals while being trained by evolution on genetic inclusive fitness, but they don't explicitly optimize for that. They "optimize" for something pretty random, that looks like genetic inclusive fitness in the training environment but then in this weird modern out-of-sample environment looks completely different. We can definitely train an AI to care about the real world, but his point is that, by doing something analogous to what happened with humans, we will end up with some completely different inner goal than the goal we're training for, as happened with humans.
[Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA

Big agreement & signal boost & push for funding on The “Reverse-engineer human social instincts” research program: Yes, please, please figure out how human social instincts are generated! I think this is incredibly important, for reasons which will become obvious due to several posts I'll probably put out this summer.

Is ELK enough? Diamond, Matrix and Child AI

Hm. I've often imagined a "keep the diamond safe" planner just choosing a plan which a narrow-ELK-solving reporter says is OK.

How do you imagine the reporter being used?

2Rohin Shah4mo
But where does the plan come from? If you're imagining that the planner creates N different plans and then executes the one that the reporter says is OK, then I have the same objection: Planner proposes some actions, call them A. The human raters use the reporter to understand the probable consequences of A, how those consequences should be valued, etc. This allows them to provide good feedback on A, creating a powerful and aligned oversight process that can be used as a training signal for the planner.
ELK Proposal: Thinking Via A Human Imitator

Later in the post, I proposed a similar modification:

I think we should modify the simplified hand-off procedure I described above so that, during training:

• A range of handoff thresholds and  proportions are drawn—in particular, there should be a reasonable probability of drawing  values close to 0, close to 1, and also 0 and 1 exactly.
• The human net runs for  steps before calling the reporter.
ELK Proposal: Thinking Via A Human Imitator

I do think this is what happens given the current architecture. I argued that the desired outcome solves narrow ELK as a sanity check, but I'm not claiming that the desired setup is uniquely loss-minimizing.

Part of my original intuition was "the human net is set up to be useful at predicting given honest information about the situation", and "pressure for simpler reporters will force some kind of honesty, but I don't know how far that goes." As time passed I became more and more aware of how this wasn't the best/only way for the human net to help with prediction, and turned more towards a search for a crisp counterexample.

Implications of automated ontology identification

What we're actually doing is here is defining "automated ontology identification"

(Flagging that I didn't understand this part of the reply, but don't have time to reload context and clarify my confusion right now)

If you deny the existence of a true decision boundary then you're saying that there is just no fact of the matter about the questions that we're asking to automated ontology identification. How then would we get any kind of safety guarantee (conservativeness or anything else)?

When you assume a true decision boundary, you're a... (read more)

Is ELK enough? Diamond, Matrix and Child AI

Why would the planner have pressure to choose something which looks good to the predictor, but is secretly bad, given that it selects a plan based on what the reporter says? Is this a Goodhart's curse issue, where the curse afflicts not the reporter (which is assumed conservative, if it's the direct translator), but the predictor's own understanding of the situation?

2Rohin Shah4mo
... What makes you think it does this? That wasn't part of my picture.
Implications of automated ontology identification

I don't understand why a strong simplicity guarantee places most of the difficulty on the learning problem. In the diamond situation, a strong simplicity requirement on the reporter can mean that the direct translator gets ruled out, since it may have to translate from a very large and sophisticated AI predictor?

if automated ontology identification does turn out to be possible from a finite narrow dataset, and if automated ontology identification requires an understanding of our values, then where did the information about our values come from? It did not

4Alex Flint4mo
What we're actually doing is here is defining "automated ontology identification" as an algorithm that only has to work if the predictor computes intermediate results that are sufficiently "close" to what is needed to implement a conservative helpful decision boundary. Because we're working towards an impossibility result, we wanted to make it as easy as possible for an algorithm to meet the requirements of "automated ontology identification". If some proposed automated ontology identifier works without the need for any such "sufficiently close intermediate computation" guarantee then it should certainly work in the presence of such a guarantee. So this "sufficiently close intermediate computation" guarantee kind of changes the learning problem from "find a predictor that predicts well" to "find a predictor that predicts well and also computes intermediate results meeting a certain requirement". That is a strange requirement to place on a learning process, but it's actually really hard to see any way to avoid make some such requirement, because if we place no requirement at all then what if the predictor is just a giant lookup table? You might say that such a predictor would not physically fit inside the whole universe, and that's correct, and is why we wanted to operationalize this "sufficiently close intermediate computation" guarantee, even though it changes the definition of the learning problem in a very important way. But this is all just to make the definition of "automated ontology identification" not unreasonably difficult, in order that we would not be analyzing a kind of "straw problem". You could ignore the "sufficiently close intermediate computation" guarantee completely and treat the write-up as analyzing the more difficult problem of automated ontology identification without any guarantee about the intermediate results computed by the predictor. Well yeah of course but if you don't think it's reasonable that any algorithm could meet this requireme
Is ELK enough? Diamond, Matrix and Child AI

Why is this what you often imagine? I thought that in classic ELK architectural setup, the planner uses the outputs of both a predictor and reporter in order to make its plans, eg using the reporter to grade plans and finding the plan which most definitely contains a diamond (according to the reporter). And the simplest choice would be brute-force search over action sequences.

After all, here's the architecture:

But in your setup, the planner would be another head branching from "figure out what's going on", which means that it's receiving the results of a c... (read more)

4Rohin Shah4mo
Mostly my reason for thinking about my architecture is that if the planner is separate it seems so obviously doomed (from a worst-case perspective, for outer alignment / building a good oversight process). The planner "knows" how and why it chose the action sequence while the predictor doesn't, and so it's very plausible that this allows the planner to choose some bad / deceptive sequence that looks good to the predictor. (The classic example is that plagiarism is easy to commit but hard to detect just from the output; see this post [https://ai-alignment.com/the-informed-oversight-problem-1b51b4f66b35].) But if you had me speculate about what ARC thinks, my probably-somewhat-incorrect understanding is "in theory, the oversight process (predictor + reporter) needs to be able to run exponentially-large searches (= perfect optimization in an exponentially large space) over the possible computations that the planner could have done (see idealized ascription universality [https://ai-alignment.com/towards-formalizing-universality-409ab893a456]); in practice we're going to give it "hints" about what computation the planner actually does by e.g. sharing weights or looking at the planner's activations". I would assume the action sequence input is a variable-length list (e.g. you pass all the actions through an LSTM and then the last output / hidden state is provided as an input to the rest of the neural net). The planner can be conditioned on the last N actions and asked to produce the next action (and initially N = 0).
Prizes for ELK proposals

How do we know that the "prediction extractor" component doesn't do additional serious computation, so that it knows something important that the "figure out what's going on" module doesn't know? If that were true, the AI as a whole could know the diamond was stolen, without the "figure out what's going on" module knowing, which means even the direct translator wouldn't know, either. Are we just not giving the extractor that many parameters?

TurnTrout's shortform feed

How the power-seeking theorems relate to the selection theorem agenda.

1. Power-seeking theorems. P(agent behavior | agent decision-making procedure, agent objective, other agent internals, environment).

I've mostly studied the likelihood function for power-seeking behavior: what decision-making procedures, objectives, and environments produce what behavioral tendencies. I've discovered some gears for what situations cause what kinds of behaviors.
1. The power-seeking theorems also allow some discussion of P(agent behavior | agent training process, trai
Instrumental Convergence For Realistic Agent Objectives

I agree that in certain conceivable games which are not baseline SC2, there will be different power-seeking incentives for negative alpha weights. My commentary wasn't intended as a generic takeaway about negative feature weights in particular.

But in the game which actually is SC2, where you don't start with a huge number of resources, negative alpha weights don't incentivize power-seeking. You do need to think about the actual game being considered, before you can conclude that negative alpha weighs imply such-and-such a behavior.

But the numbe

Instrumental Convergence For Realistic Agent Objectives

I'm implicitly assuming a fixed opponent policy, yes.

Without being overly familiar with SC2—you don't have to kill your opponent to get to 0 resources, do you? From my experience with other RTS games, I imagine you can just quickly build units and deplete your resources, and then your opponent can't make you accrue more resources. Is that wrong?

3Jacob Pfau5mo
Yes, I agree that in the simplest case, SC2 with default starting resources, you just build one or two units and you're done. However, I don't see why this case should be understood as generically explaining the negative alpha weights setting. Seems to me more like a case of an excessively simple game? Consider the set of games starting with various quantities of resources and negative alpha weights. As starting resources increase, you will be incentivised to go attack your opponent to interfere with their resource depletion. Indeed, if the reward is based on end-of-game resource minimisation, you end up participating in an unbounded resource-maximisation competition trying to guarantee control over your opponent; then you spend your resources safely after crippling your opponent? In the single player setting, you will be incentivised to build up your infrastructure so as to spend your resources more quickly. It seems to me the multi-player case involves power-seeking. Then, it seems like negative alpha weights don't generically imply anything about the existence of power-seeking incentives? (I'm actually not clear on whether the single-player case should be seen as power-seeking or not? Maybe it depends on your choice of discount rate, gamma? You are building up infrastructure, i.e. unit-producing buildings, which seems intuitively power-seeking. But the number of long-term possibilities available to you following spending resources on infrastructure is reduced -- assuming gamma=1 -- OTOH the number of short-term possibilities may be higher given infrastructure, so you may have increased power assuming gamma<1?)
TurnTrout's shortform feed

Argument sketch for why boxing is doomed if the agent is perfectly misaligned:

Consider a perfectly misaligned agent which has -1 times your utility function—it's zero-sum. Then suppose you got useful output of the agent. This means you're able to increase your EU. This means the AI decreased its EU by saying anything. Therefore, it should have shut up instead. But since we assume it's smarter than you, it realized this possibility, and so the fact that it's saying something means that it expects to gain by hurting your interests via its output. Therefore, the output can't be useful.

4Viliam5mo
Makes sense, with the proviso that this is sometimes true only statistically. Like, the AI may choose to write an output which has a 70% chance to hurt you and a 30% chance to (equally) help you, if that is its best option. If you assume that the AI is smarter than you, and has a good model of you, you should not read the output. But if you accidentally read it, and luckily you react in the right (for you) way, that is a possible result, too. You just cannot and should not rely on being so lucky.
What counts as defection?

This post's main contribution is the formalization of game-theoretic defection as gaining personal utility at the expense of coalitional utility

Rereading, the post feels charmingly straightforward and self-contained. The formalization feels obvious in hindsight, but I remember being quite confused about the precise difference between power-seeking and defection—perhaps because popular examples of taking over the world are also defections against the human/AI coalition. I now feel cleanly deconfused about this distinction. And if I was confused about... (read more)

Vanessa Kosoy's Shortform

The Hippocratic principle seems similar to my concept of non-obstruction (https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility), but subjective from the human's beliefs instead of the AI's.

2Vanessa Kosoy6mo
Yes, there is some similarity! You could say that a Hippocratic AI needs to be continuously non-obstructive w.r.t. the set of utility functions and priors the user could plausibly have, given what the AI knows. Where, by "continuously" I mean that we are allowed to compare keeping the AI on or turning off at any given moment.
Biology-Inspired AGI Timelines: The Trick That Never Works

I find it concerning that you felt the need to write "This is not at all a criticism of the way this post was written. I am simply curious about my own reaction to it" (and still got downvoted?).

For my part, I both believe that this post contains valuable content and good arguments, and that it was annoying / rude / bothersome in certain sections.

Formalizing Policy-Modification Corrigibility

The biggest disconnect is that this post is not a proposal for how to solve corrigibility. I'm just thinking about what corrigibility is/should be, and this seems like a shard of it—but only a shard. I'll edit the post to better communicate that.

So, your points are good, but they run skew to what I was thinking about while writing the post.

Formalizing Policy-Modification Corrigibility

I skip over those pragmatic problems because this post is not proposing a solution, but rather a measurement I find interesting.

Biology-Inspired AGI Timelines: The Trick That Never Works

OK, I'll bite on EY's exercise for the reader, on refuting this "what-if":

Humbali:  Then here's one way that the minimum computational requirements for general intelligence could be higher than Moravec's argument for the human brain.  Since, after, all, we only have one existence proof that general intelligence is possible at all, namely the human brain.  Perhaps there's no way to get general intelligence in a computer except by simulating the brain neurotransmitter-by-neurotransmitter.  In that case you'd need a lot more computing oper

Soares, Tallinn, and Yudkowsky discuss AGI cognition

no-one has the social courage to tackle the problems that are actually important

I would be very surprised if this were true. I personally don't feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.

I would guess that if people aren't tackling Hard Problems enough, it's not because they lack social courage, but because 1) they aren't running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they're wrong about what problems are Hard Problems. My money's mostly on (1), with a bit of (2).

Solve Corrigibility Week

But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.

Noting that I strongly disagree but don't have time to type out arguments right now, sorry. May or may not type out later.

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

Addendum: One lesson to take away is that quantilization doesn't just depend on the base distribution being safe to sample from unconditionally. As the theorems hint, quantilization's viability depends on base(plan | plan doing anything interesting) also being safe with high probability, because we could (and would) probably resample the agent until we get something interesting. In this post's terminology, A := {safe interesting things}, B := {power-seeking interesting things}, C:= A and B and {uninteresting things}.

Ngo and Yudkowsky on alignment difficulty

I've started commenting on this discussion on a Google Doc. Here are some excerpts:

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up.

• Well-modelled as binary "has-AGI?" predicate;
• (I am sympathetic to the microeconomics of intelligence explosion working out in a way where "Well-modelled
Corrigibility Can Be VNM-Incoherent

Although I didn't make this explicit, one problem is that manipulation is still weakly optimal—as you say. That wouldn't fit the spirit of strict corrigibility, as defined in the post.

Note that AUP doesn't have this problem.

Corrigibility Can Be VNM-Incoherent

though since  can be embedded into [Vect], it surely can't hurt too much

As an aside, can you link to/say more about this? Do you mean that there exists a faithful functor from Set to Vect (the category of vector spaces)? If you mean that, then every concrete category can be embedded into Vect, no? And if that's what you're saying, maybe the functor Set -> Vect is something like the "Group to its group algebra over field " functor.

Corrigibility Can Be VNM-Incoherent

I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce "smart" or "powerful" behavior from simple rules.

I share an intuition in this area, but "powerful" behavior tendencies seems nearly equivalent to instrumental convergence to me. It feels logically downstream of instrumental convergence.

from simple rules

I already have a (somewhat weak) result on power-seeking wrt the simplicity prior over state-based reward functions. This isn't about utility functions over policies, though... (read more)

Corrigibility Can Be VNM-Incoherent

So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like "humans have control over the AI" (as this is a causal statement and thus depends on

Corrigibility Can Be VNM-Incoherent

change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.

I'm not parsing this. You change the utility function, but it ends up in the same place with the same utility function? Did we change it or not? (I think simply rewording it will communicate your point to me)

1Charlie Steiner7mo
So we have a switch with two positions, "R" and "L." When the switch is "R," the agent is supposed to want to go to the right end of the hallway, and vice versa for "L" and left. It's not that you want this agent to be uncertain about the "correct" value of the switch and so it's learning more about the world as you send it signals - you just want the agent to want to go to the left when the switch is "L," and to the right when the switch is "R." If you start with the agent going to the right along this hallway, and you change the switch to "L," and then a minute later change your mind and switch back to "R," it will have turned around and passed through the same spot in the hallway multiple times. The point is that if you try to define a utility as a function of the state for this agent, you run into an issue with cycles - if you're continuously moving "downhill", you can't get back to where you were before.
Corrigibility Can Be VNM-Incoherent

If you can correct the agent to go where you want, it already wanted to go where you want. If the agent is strictly corrigible to terminal state , then  was already optimal for it.

If the reward function has a single optimal terminal state, there isn't any new information being added by . But we want corrigibility to let us reflect more on our values over time and what we want the AI to do!

If the reward function has multiple optimal terminal states, then corrigibility again becomes meaningful. Bu

Corrigibility Can Be VNM-Incoherent

I worded the title conservatively because I only showed that corrigibility is never nontrivially VNM-coherent in this particular MDP Maybe there's a more general case to be proven for all MDPs, and using more realistic (non-single-timestep) reward aggregation schemes.

Ngo and Yudkowsky on alignment difficulty

Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence

I'm interested in hearing about how your approach handles this environment, because I think I'm getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.

0Koen Holtman7mo
Update: I just recalled that Eliezer and MIRI often talk about Dutch booking [https://en.wikipedia.org/wiki/Dutch_book] when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here. When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below: To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.

Read your post, here are my initial impressions on how it relates to the discussion here.

In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.

However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the te... (read more)

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

Could you give a toy example of this being insufficient (I'm assuming the "set copy relation" is the "B contains n of A" requiring)?

A:={(1 0 0)} B:={(0 .3 .7), (0 .7 .3)}

Less opaquely, see the technical explanation for this counterexample, where the right action leads to two trajectories, and up leads to a single one.

How does the "B contains n of A" requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

For the "Theorem: Training retargetability criterion", where f(A, u) >= its involution, what would be the case where it's not greater/equal to it's involution? Is this when the options in B are originally more optimal?

I don't think I understand the question. Can you rephrase?

Also, that theorem requires each involution to be greater/equal than the original. Is this just to get a lower bound on the n-multiple or do less-than involutions not add anything?

Less-than involutions aren't guaranteed to add anything. For example, if  iff a goes le... (read more)

3Logan Riggs Smith7mo
Your example actually cleared this up for me as well! I wanted an example where the inequality failed even if you had an involution on hand.
Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

Right, I was intending "3. [these results] don't account for the ways in which we might practically express reward functions" to capture that limitation.

TurnTrout's shortform feed

# How might we align AGI without relying on interpretability?

I'm currently pessimistic about the prospect. But it seems worth thinking about, because wouldn't it be such an amazing work-around?

My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its  parameter space being 3-colored as follows:

• Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model)
• Red if the parameter vector+... leads to a misaligned or dece