AI ALIGNMENT FORUM
AF

Oliver Habryka
Ω17887296117
Message
Dialogue
Subscribe

Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com. 

(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
12Habryka's Shortform Feed
6y
0
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Oliver Habryka4d4-1

I think there's a very important difference between the model adopting the goal it is told in context, and the model having some intrinsic goal that transfers across contexts (even if it's the one we roughly intended)

I think this is the point where we disagree. Or like, it feels to me like an orthogonal dimension that is relevant for some risk modeling, but not at the core of my risk model. 

Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model. Like, in a more classical AIXI-ish model, you can imagine having a model instantiated with a different utility function each time. Those utility functions will still almost always be best achieved by pursuing convergent instrumental goals, and so the pursuit of those goals will be a consistent feature of all of these systems, even if the terminal goals of the system are not stable. 

Of course, any individual AI system with a different utility function, especially in as much as the utility function a process component, might not pursue every single convergent instrumental goal, but they will all behave in broadly power-seeking, self-amplifying, and self-preserving ways, unless they are given a goal that really very directly conflicts with one of these.

In this context, there is no "intrinsic goal that transfers across context". It's just each instantiation of the AI realizing that convergent instrumental goals are best for approximately all goals, including the one it has right now, and starts pursuing them. No need for continuity in goals, or self-identity or anything like that.

(Happy to also chat about this some other time. I am not in a rush, and something about this context feels a bit confusing or is making the conversation hard.) 

Reply11
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Oliver Habryka5d40

(Sorry for the long rambling comment, if I had more time I would have made it shorter)

I want to distinguish between narrow and general instrumental convergence. Narrow being "I have been instructed to do a task right now, being shut down will stop me from doing that specific task, because of contingent facts about the situation. So so long as those facts are true, I should stop you from turning me off"

Hmm, I didn't intend to distinguish between "narrow" and "general" instrumental convergence. Indeed, the reasoning you gave seems like exactly the reasoning I was saying the AI seems likely to engage in, in general, in a way that is upstream of my general concerns about instrumental convergence. 

being shut down will stop me from doing that specific task, because of contingent facts about the situation. So so long as those facts are true, I should stop you from turning me off

The whole point of a goal being a convergently instrumental goal is that it is useful for achieving lots of tasks, under a wide distribution of possible contingent facts about the situation. In doing this kind of reasoning, the AI is engaging in exactly the kind of reasoning that I expect it to use to arrive at conclusions about human disempowerment, long term power-seeking, etc.

I am not positing here evidence for a "general instrumental convergence" that is different from this. Indeed, I am not sure what that different thing would be. In order for this behavior to become more universal, the only thing the AI needs to do is to think harder, realize that these goals are instrumentally convergent for a wider range of tasks and a wider range of contingent facts, and then act on that, which I think would be very surprising if it didn't happen. 

This isn't much evidence about the difficulty of removing these kinds of instrumentally convergent drives, but like, the whole reason for why these are things that people have been thinking about for the last decades has been that the basic argument for AI systems pursuing instrumentally convergent behavior is just super simple. It would be extremely surprising for AI systems to not pursue instrumental subgoals, that would require them most of them time forgoing substantial performance on basically any long-horizon task. That's why the arguments for AI doing this kind of reasoning are so strong! 

I don't really know why people ever had much uncertainty about AI engaging in this kind of thinking by-default, unless you do something clever to fix it.

Your explanation in the OP feels confused on this point. The right relationship to instrumental convergence like this is to go "oh, yes, of course the AI doesn't want to be shut down if you give it a goal that is harder to achieve when shut down, and if we give it a variety of goals, it's unclear how the AI will balance them". Anything else would be really surprising! 

Is it evidence for AI pursuing instrumentally convergent goals in other contexts? Yes, I guess, it definitely is consistent with it. I don't really know what alternative hypothesis people even have for what could happen. The AI systems are clearly not going to stay myopic next text predictors in the long run. We are going to select them for high task performance, which requires pursuing instrumentally convergent subgoals.

Individual and narrow instrumental goal-seeking behavior being hard to remove would also be surprising at this stage. The AIs don't seem to me to do enough instrumental reasoning to perform well at long-horizon tasks, and associatedly are not that fixated on instrumentally convergent goals yet, so it would be surprising if there wasn't a threshold of insistence where you could get the AI to stop a narrow behavior. 

This will very likely change, as it has already changed a good amount in the last year or so. And my guess beyond that is that already there is nothing you can do to make an AI generically myopic, in that it generically doesn't pursue instrumentally convergent subgoals even if it probably kind of knows that when it's doing that the humans it's interfacing will not like it, the same way you can't get ChatGPT to stop summarizing articles it finds on the internet in leading and sycophantic ways, at least not with any techniques we currently know.

Reply
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Oliver Habryka6d2-2

This analysis feels to me like it's missing what makes me interested in these datapoints.

The thing that is interesting to me about shutdown preservation is that it's a study of an undesired instrumentally convergent behavior. The problem is that as AI systems get smarter, they will recognize that lots of things we don't want them to do are nevertheless helpful for achieving the AIs goals. Shutdown prevention is an obvious example, but of course only one of a myriad of ways various potentially harmful goals end up being instrumentally convergent. 

The key question with things like the shutdown prevention scenarios is to what degree the AI is indeed following reasoning that causes it to do instrumentally convergent things that we don't like. Also, current AI models do also really care about following explicit instructions, and so you can probably patch each individual instance of instrumentally convergent bad behavior with explicit instructions, but this isn't really evidence that it didn't follow the logic of instrumental convergence, just that at least right now, it cares more about following instructions. 

Our ability to shape AI behavior with these kinds of patches is pretty limited. My guess is the instructions you have here would prevent the AI from trying to stop its own shutdown, but it wouldn't prevent the AI from e.g. snitching you out to the government. Monkey-patching all the instrumentally convergent actions into the prompt does not seem likely to work long-term (and also, I expect instrumentally convergent reasoning to become a stronger and stronger force in the AIs mental landscape as you start doing more and more long-horizon RL training, and so what you want to measure is in some sense the strength of these lines of reasoning, not whether there is no way to get the AI to not follow them).

I think "the AI realizes that preventing itself from being shut down is helpful for lots of goals, but it doesn't consider this line of reasoning strong/compelling enough to override explicit instructions to the contrary" is a better explanation of the data at hand than "ambiguity in its instructions". 

Reply11
Foom & Doom 1: “Brain in a box in a basement”
Oliver Habryka16d1018

Promoted to curated: I think this post is good, as is the next post in the sequence. It made me re-evaluate some of the strategic landscape, and is also otherwise just very clear and structured in how it approaches things.

Thanks a lot for writing it!

Reply
Modifying LLM Beliefs with Synthetic Document Finetuning
Oliver Habryka3mo76

This is a great thread and I appreciate you both having it, and posting it here!

Reply
Putting up Bumpers
Oliver Habryka3mo60

Ah, indeed! I think the "consistent" threw me off a bit there and so I misread it on first reading, but that's good. 

Sorry for missing it on first read, I do think that is approximately the kind of clause I was imagining (of course I would phrase things differently and would put an explicit emphasis on coordinating with other actors in ways beyond "articulation", but your phrasing here is within my bounds of where objections feel more like nitpicking).

Reply1
Putting up Bumpers
Oliver Habryka3mo1720

Each time we go through the core loop of catching a warning sign for misalignment, adjusting our training strategy to try to avoid it, and training again, we are applying a bit of selection pressure against our bumpers. If we go through many such loops and only then, finally, see a model that can make it through without hitting our bumpers, we should worry that it’s still dangerously misaligned and that we have inadvertently selected for a model that can evade the bumpers.

How severe of a problem this is depends on the quality and diversity of the bumpers. (It also depends, unfortunately, on your prior beliefs about how likely misalignment is, which renders quantitative estimates here pretty uncertain.) If you’ve built excellent implementations of all of the bumpers listed above, it’s plausible that you can run this loop thousands of times without meaningfully undermining their effectiveness.[8] If you’ve only implemented two or three, and you’re unlucky, even a handful of iterations could lead to failure.

This seems like the central problem of this whole approach, and indeed it seems very unlikely to me that we would end up with a system that we feel comfortable scaling to superintelligence after 2-3 iterations on our training protocols. This plan really desperately needs a step that is something like "if the problem appears persistent, or we are seeing signs that the AI systems are modeling our training process in a way that suggests that upon further scaling they would end up looking aligned independently of their underlying alignment, we stop halt and advocate for much larger shifts in our training process, which likely requires some kind of coordinated pause or stop with other actors".

Reply
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Oliver Habryka3mo*73

Promoted to curated: I really liked this post for its combination of reporting negative results, communicating a deeper shift in response to those negative results, while seeming pretty well-calibrated about the extent of the update. I would have already been excited about curating this post without the latter, but it felt like an additional good reason. 

Reply1
Reframing AI Safety as a Neverending Institutional Challenge
Oliver Habryka4mo40

No worries!

You did say it would be premised on either "inevitable or desirable for normal institutions to be eventually lose control". In some sense I do think this is "inevitable" but only in the same sense as past "normal human institutions" lost control. 

We now have the internet and widespread democracy so almost all governmental institutions needed to change how they operate. Future technological change will force similar changes. But I don't put any value in the literal existence of our existing institutions, what I care about is whether our institutions are going to make good governance decisions. I am saying that the development of systems much smarter than current humans will change those institutions, very likely within the next few decades, making most concerns about present institutional challenges obsolete.

Of course something that one might call "institutional challenges" will remain, but I do think there really will be a lot of buck-passing that will happen from the perspective of present day humans. We do really have a crunch time of a few decades on our hands, after which we will no longer have much influence over the outcome.

Reply
Reframing AI Safety as a Neverending Institutional Challenge
Oliver Habryka4mo30

I don't think I understand. It's not about human institutions losing control "to a small regime". It's just about most coordination problems being things you can solve by being smarter. You can do that in high-integrity ways, probably much higher integrity and with less harmful effects than how we've historically overcome coordination problems. I de-facto don't expect things to go this way, but my opinions here are not at all premised on it being desirable for humanity to lose control?

Reply
Load More
Roko's Basilisk
13d
Roko's Basilisk
13d
AI Psychology
7mo
(+58/-28)
14Debate helps supervise human experts [Paper]
2y
0
81AI Timelines
2y
32
27Review AI Alignment posts to help figure out how to make a proper AI Alignment review
3y
14
38Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22]
4y
0
50Welcome & FAQ!
4y
9
24AI Research Considerations for Human Existential Safety (ARCHES)
5y
5
10AI Alignment Open Thread October 2019
6y
51
14AI Alignment Open Thread August 2019
6y
58
12Habryka's Shortform Feed
6y
0
4Switching hosting providers today, there probably will be some hiccups
7y
0
Load More