Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter

Wiki Contributions


Rant on Problem Factorization for Alignment

One more disanalogy:

4. the rest of the world pays attention to large or powerful real-world bureaucracies and force rules on them that small teams / individuals can ignore (e.g. Secret Congress, Copenhagen interpretation of ethics, startups being able to do illegal stuff), but this presumably won't apply to alignment approaches.

One other thing I should have mentioned is that I do think the "unconscious economics" point is relevant and could end up being a major problem for problem factorization, but I don't think we have great real-world evidence suggesting that unconscious economics by itself is enough to make teams of agents not be worthwhile.

Re disanalogy 1: I'm not entirely sure I understand what your objection is here but I'll try responding anyway.

I'm imagining that the base agent is an AI system that is pursuing a desired task with roughly human-level competence, not something that acts the way a whole-brain emulation in a realistic environment would act. This base agent can be trained by imitation learning where you have the AI system mimic human demonstrations of the task, or by reinforcement learning on a reward model trained off of human preferences, but (we hope) is just trying to do the task and doesn't have all the other human wants and desires. (Yes, this leaves a question of how you get that in the first place; personally I think that this distillation is the "hard part", but that seems separate from the bureaucracy point.)

Even if you did get a bureaucracy made out of agents with human desires, it still seems like you get a lot of benefit from the fact that the agents are identical to each other, and so have less politics.

Re disanalogy 3: I agree that you have to think that a small / medium / large bureaucracy of Alices-with-15-minutes will at least slightly outperform an individual / small / medium bureaucracy of Alices-with-15-minutes before this disanalogy is actually a reason for optimism. I think that ends up coming from disanalogies 1, 2 and 4, plus some difference in opinion about real-world bureaucracies, e.g. I feel pretty good about small real-world teams beating individuals.

I mostly mention this disanalogy as a reason not to update too hard on intuitions like Can HCH epistemically dominate Ramanujan? and this SlateStarCodex post.

On reflection I think there's a strong chance you have tried picturing that, but I'm not confident, so I mention it just in case you haven't.

Yeah I have. Personally my inner sim feels pretty great about the combination of disanalogy 1 and disanalogy 2 -- it feels like a coalition of Rohins would do so much better than an individual Rohin, as long as the Rohins had time to get familiar with a protocol and evolve it to suit their needs. (Picturing some giant number of Rohins a la disanalogy 3 is a lot harder to do but when I try it mostly feels like it probably goes fine.)

Rant on Problem Factorization for Alignment

Like Wei Dai, I think there's a bunch of pretty big disanalogies with real-world examples that make me more hopeful than you:

  1. Typical humans in typical bureaucracies do not seem at all aligned with the goals that the bureaucracy is meant to pursue.
  2. Since you reuse one AI model for each element of the bureaucracy, doing prework to establish sophisticated coordinated protocols for the bureaucracy takes a constant amount of effort, whereas in human bureaucracies it would scale linearly with the number of people. As a result with the same budget you can establish a much more sophisticated protocol with AI than with humans.
    1. In the relay experiment looking at the results it intuitively feels like people could have done significantly better by designing a better protocol in advance and coordinating on it, though I wouldn't say that with high confidence.
  3. After a mere 100 iterations of iterated distillation and amplification where each agent can ask 2 subquestions, you are approximating a bureaucracy of 2^100 agents, which is wildly larger than any human bureaucracy and has qualitatively different strategies available to it. Probably it will be a relatively bad approximation but the exponential scaling with linear iterations still seems pretty majorly different from human bureaucracies.

I think these disanalogies are driving most of the disagreement, rather than things like "not knowing about real-world evidence" or even "failing to anticipate results in simple cases we can test today". For example, for the relay experiment you mention, at least I personally (and probably others) did in fact anticipate these results in advance. Here's a copy of this comment of mine (as a Facebook comment it probably isn't public, sorry), written before anyone had actually played a relay game (bold added now, to emphasize where it agrees with what actually happened):

Worry: 10 minutes seems near-impossible for the cubes problem [a specific Euler project problem]. It is *difficult* to explain things to others. Even between two copies of me, one that knows the answer and one that doesn't, this is hard, and it becomes way harder when it's different people. I think I'd predict >50x slowdown on the cubes problem relative to how long I'd take, and if I take similar times as you, then that would take >250 person-hours = 1500 agents. My intuitive model says it'll never be solved, unless one of the agents is one of the experts who can directly solve the problem in under 10 minutes. This model is wrong -- I'm not properly imagining the clever strategies that could evolve as time goes on -- but I do think it would take a long time for such strategies to evolve.

A 45-minute problem seems much more doable, the solutions/approaches should be explainable in less than 10 minutes. I'm a quite uncertain what the slowdown would be on that, and how the solution would be generated -- I think it's plausible that the solution just ends up being that the 5th person reads the problem statement and just figures out the answer ignoring the work/instructions from people 1-4, and explains the algorithm in their email, and then future agents implement the algorithm.

(I think I would have been significantly more optimistic if each individual person had, say, 30 minutes of time, even if they were working on a relatively harder problem. I didn't find any past quotes to that effect though. In any case that's how I feel about it now.)

One question is why Ought did these experiments if they didn't expect success? I don't know what they expected but I do remember that their approach was very focused on testing the hardest cases (I believe in order to find the most shaky places for Factored Cognition, though my memory is shaky there), so I'm guessing they also thought a negative outcome was pretty plausible.

On how various plans miss the hard bits of the alignment challenge

My guess at part of your views:

  1. There's ~one natural structure for capabilities, such that (assuming we don't have deep mastery of intelligence) nearly anything we build that is an AGI will have that structure.
  2. Given this, there will be a point where an AI system switches from everything-muddled-in-a-soup to clean capabilities and muddled alignment (the "sharp left turn").

I basically agree that the plans I consider don't engage much with this sort of scenario. This is mostly because I don't expect this scenario and so I'm trying to solve the alignment problem in the worlds I do expect.

(For the reader: I am not saying "we're fucked if the sharp left turn happens so we should ignore it", I am saying that the sharp left turn is unlikely.)

A consequence is that I care a lot about knowing whether the sharp left turn is actually likely. Unfortunately so far I have found it pretty hard to understand why exactly you and Eliezer find it so likely. I think current SOTA on this disagreement is this post and I'd be keen on more work along those lines.

Some commentary on the conversation with me:

Imaginary Richard/Rohin: You seem awfully confident in this sharp left turn thing. And that the goals it was trained for won't just generalize. This seems characteristically overconfident.

This isn't exactly wrong -- I do think you are overconfident -- but I wouldn't say something like "characteristically overconfident" unless you were advocating for some particular decision right now which depended on others deferring to your high credences in something. It just doesn't seem useful to argue this point most of the time and it doesn't feature much in my reasoning.

For instance, observe that natural selection didn't try to get the inner optimizer to be aligned with inclusive genetic fitness at all. For all we know, a small amount of cleverness in exposing inner-misaligned behavior to the gradients will just be enough to fix the problem.

Good description of why I don't find the evolution analogy compelling for "sharp left turn is very likely".

And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don't see why you're worried.

I'd phrase it as "I don't see why you think [sharp left turn leading to failures of generalization of alignment that we can't notice and fix before we're dead] is very likely to happen". I'm worried too!

Nate: My model says that the hard problem rears its ugly head by default, in a pretty robust way. Clever ideas might suffice to subvert the hard problem (though my guess is that we need something more like understanding and mastery, rather than just a few clever ideas). I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you're putting most of your hope on small clever ideas that I can already see would fail. But perhaps you have ideas that I do not. Do you yourself have any specific ideas for tackling the hard problem?

Imaginary Richard/Rohin: Train it, while being aware of inner alignment issues, and hope for the best.

I think if you define the hard problem to be the sharp left turn as described at the beginning of my comment then my response is "no, I don't usually focus on that problem" (which I would defend as the correct action to take).

Also if I had to summarize the plan in a sentence it would be "empower your oversight process as much as possible to detect problems in the AI system you're training (both in the outcomes it produces and the reasoning process it employs)".

Nate: That doesn't seem to me to even start to engage with the issue where the capabilities fall into an attractor and the alignment doesn't.

Yup, agreed.

Though if you weaken claim 1, that there is ~one natural structure to capabilities, to instead say that there are many possible structures to capabilities but the default one is deadly EU maximization, then I no longer agree. It seems pretty plausible to me that stronger oversight changes the structure of your capabilities.

Perhaps sometime we can both make a list of ways to train with inner alignment issues in mind, and then share them with each other, so that you can see whether you think I'm lacking awareness of some important tool you expect to be at our disposal, and so that I can go down your list and rattle off the reasons why the proposed training tools don't look to me like they result in alignment that is robust to sharp left turns. (Or find one that surprises me, and update.) But I don't want to delay this post any longer, so, some other time, maybe.

I think the more relevant cruxes are the claims at the top of this comment (particularly claim 1); I think if I've understood the "sharp left turn" correctly I agree with you that the approaches I have in mind don't help much (unless the approaches succeed wildly, to the point of mastering intelligence, e.g. my approaches include mechanistic interpretability which as you agree could in theory get to that point even if they aren't likely to in practice).

rohinmshah's Shortform

I mentioned above that I'm not that keen on assistance games because they don't seem like a great fit for the specific ways we're getting capabilities now. A more direct comment on this point that I recently wrote:

I broadly agree that assistance games are a pretty great framework. The main reason I don’t work on them is that it doesn’t seem like it works as a solution if you expect AGI via scaled up deep learning. (Whereas I’d be pretty excited about pushing forward on it if it looked like we were getting AGI via things like explicit hierarchical planning or search algorithms.)

The main difference in the deep learning case is that with scaled up deep learning it looks like you are doing a search over programs for a program that performs well on your loss function, and the intelligent thing is the learned program as opposed to the search that found the learned program. if you wanted assistance-style safety, then the learned program needs to reason in a assistance-like way (i.e. maintain uncertainty over what the humans want, and narrow down the uncertainty by observing human behavior).

But then you run into a major problem, which is that we have no idea how to design the learned program, precisely because it is learned — all we do is constrain the behavior of the learned program on the particular inputs that we trained on, and there are many programs you could learn that have that behavior, some of which reason in a CIRL-like way and some of which don’t. (If you then try to solve this problem, you end up regenerating many of the directions that other alignment people work on.)

Forecasting Thread: AI Timelines

Some updates:

  • This should really be thought of as "when we see the transformative economic impact", I don't like the "when model training is complete" framing (for basically the reason mentioned above, that there may be lots of models).
  • I've updated towards shorter timelines; my median is roughly 2045 with a similar shape of the distribution as above.
  • One argument for shorter timelines than that in bio anchors is "bio anchors doesn't take into account how non-transformative AI would accelerate AI progress".
  • Another relevant argument is "the huge difference between training time compute and inference time compute suggests that we'll find ways to get use out of lots of inferences with dumb models rather than a few inferences with smart models; this means we don't need models as smart as the human brain, thus lessening the needed compute at training time".
  • I also feel more strongly about short horizon models probably being sufficient (whereas previously I mostly had a mixture between short and medium horizon models).
  • Conversely, reflecting on regulation and robustness made me think I was underweighting those concerns, and lengthened my timelines.
Where I agree and disagree with Eliezer

I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I'd have actively agreed with the claims.

Things that don't meet that bar:

General: Lots of these points make claims about what Eliezer is thinking, how his reasoning works, and what evidence it is based on. I don't necessarily have the same views, primarily because I've engaged much less with Eliezer and so don't have confident Eliezer-models. (They all seem plausible to me, except where I've specifically noted disagreements below.)

Agreement 14: Not sure exactly what this is saying. If it's "the AI will probably always be able to seize control of the physical process implementing the reward calculation and have it output the maximum value" I agree.

Agreement 16: I agree with the general point but I would want to know more about the AI system and how it was trained before evaluating whether it would learn world models + action consequences instead of "just being nice", and even with the details I expect I'd feel pretty uncertain which was more likely.

Agreement 17: It seems totally fine to focus your attention on a specific subset of "easy-alignment" worlds and ensuring that those worlds survive, which could be described as "assuming there's a hope". That being said, there's something in this vicinity I agree with: in trying to solve alignment, people sometimes make totally implausible assumptions about the world; this is a worse strategy for reducing x-risk than working on the worlds you actually expect and giving them another ingredient that, in combination with a "positive model violation", could save those worlds.

Disagreement 10: I don't have a confident take on the primate analogy; I haven't spent enough time looking into it for that.

Disagreement 15: I read Eliezer as saying something different in point 11 of the list of lethalities than Paul attributes to him here; something more like "if you trained on weak tasks either (1) your AI system will be too weak to build nanotech or (2) it learned the general core of intelligence and will kill you once you get it to try building nanotech". I'm not confident in my reading though.

Disagreement 18: I find myself pretty uncertain about what to expect in the "breed corrigible humans" thought experiment.

Disagreement 22: I was mostly in agreement with this, but "obsoleting human contributions to alignment" is a pretty high bar if you take it literally, and I don't feel confident that happens before superintelligent understanding of the world (though it does seem plausible).

Coherence arguments do not entail goal-directed behavior

"random utility-maximizer" is pretty ambiguous; if you imagine the space of all possible utility functions over action-observation histories and you imagine a uniform distribution over them (suppose they're finite, so this is doable), then the answer is low.

Heh, looking at my comment it turns out I said roughly the same thing 3 years ago.

Agency As a Natural Abstraction

In a comment below, you define an optimizer as:

An optimizer is a very advanced meta-learning algorithm that can learn the rules of (effectively) any environment and perform well in it. It's general by definition. It's efficient because this generality allows it to use maximally efficient internal representations of its environment.

I certainly agree that we'll build mesa-optimizers under this definition of "optimizer". What then causes them to be goal-directed, i.e. what causes them to choose what actions to take by considering a large possible space of plans that includes "kill all the humans", predicting their consequences in the real world, and selecting the action to take based on how the predicted consequences are rated by some metric? Or if they may not be goal-directed according to the definition I gave there, why will they end the world?

Load More