Wiki Contributions


The argument I think is good (nr (2) in my previous comment) doesn't go through reference classes at all. I don't want to make an outside-view argument (eg "things we call optimization often produce misaligned results, therefore sgd is dangerous"). I like the evolution analogy because it makes salient some aspects of AI training that make misalignment more likely. Once those aspects are salient you can stop thinking about evolution and just think directly about AI.

evolution does not grow minds, it grows hyperparameters for minds.

Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.

I want to distinguish two possible takes here:

  1. The argument from direct implication: "Humans are misaligned wrt evolution, therefore AIs will be misaligned wrt their objectives"
  2. Evolution as an intuition pump: "Thinking about evolution can be helpful for thinking about AI. In particular it can help you notice ways in which AI training is likely to produce AIs with goals you didn't want"

It sounds like you're arguing against (1). Fair enough, I too think (1) isn't a great take in isolation. If the evolution analogy does not help you think more clearly about AI at all then I don't think you should change your mind much on the strength of the analogy alone. But my best guess is that most people incl Nate mean (2).

I think the specification problem is still hard and unsolved. It looks like you're using a different definition of 'specification problem' / 'outer alignment' than others, and this is causing confusion.

IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they'd lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in 'what would be useful for avoiding AGI doom'? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn't help alignment much.

More generally, I think this discussion would be more grounded / useful if you made more object-level claims about how value specification being solved (on your view) might be useful, rather than meta claims about what others were wrong about.

Do you have an example of one way that the full alignment problem is easier now that we've seen that GPT-4 can understand & report on human values?

(I'm asking because it's hard for me to tell if your definition of outer alignment is disconnected from the rest of the problem in a way where it's possible for outer alignment to become easier without the rest of the problem becoming easier).

I think it's false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it's false is mostly that I haven't seen a claim like that made anywhere, including in the posts you cite.

I agree lots of the responses elide the part where you emphasize that it's important how GPT-4 doesn't just understand human values, but is also "willing" to answer questions somewhat honestly. TBH I don't understand why that's an important part of the picture for you, and I can see why some responses would just see the "GPT-4 understands human values" part as the important bit (I made that mistake too on my first reading, before I went back and re-read).

It seems to me that trying to explain the original motivations for posts like Hidden Complexity of Wishes is a good attempt at resolving this discussion, and it looks to me as if the responses from MIRI are trying to do that, which is part of why I wanted to disagree with the claim that the responses are missing the point / not engaging productively.

You make a claim that's very close to that - your claim, if I understand correctly, is that MIRI thought AI wouldn't understand human values and also not lie to us about it (or otherwise decide to give misleading or unhelpful outputs):

The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.

I think this is similar enough (and false for the same reasons) that I don't think the responses are misrepresenting you that badly. Of course I might also be misunderstanding you, but I did read the relevant parts multiple times to make sure, so I don't think it makes sense to blame your readers for the misunderstanding.

My paraphrase of your (Matthews) position: while I'm not claiming that GPT-4 provides any evidence about inner alignment (i.e. getting an AI to actually care about human values), I claim that it does provide evidence about outer alignment being easier than we thought: we can specify human values via language models, which have a pretty robust understanding of human values and don't systematically deceive us about their judgement. This means people who used to think outer alignment / value specification was hard should change their minds.

(End paraphrase)

I think this claim is mistaken, or at least it rests on false assumptions about what alignment researchers believe. Here's a bunch of different angles on why I think this:

  1. My guess is a big part of the disagreement here is that I think you make some wrong assumptions about what alignment researchers believe.

  2. I think you're putting a bit too much weight on the inner vs outer alignment distinction. The central problem that people talked about always was how to get an AI to care about human values. E.g. in The Hidden Complexity of Wishes (THCW) Eliezer writes

To be a safe fulfiller of a wish, a genie must share the same values that led you to make the wish.

If you find something that looks to you like a solution to outer alignment / value specification, but it doesn't help make an AI care about human values, then you're probably mistaken about what actual problem the term 'value specification' is pointing at. (Or maybe you're claiming that value specification is just not relevant to AI safety - but I don't think you are?).

  1. It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now also point at an LLM and get a result that's not all that much worse than pointing at a human is not cause for an update about how hard value specification is. Part of the difficulty is how to define the pointer to the human and get a model to maximize human values rather than maximize some error in your specification. IMO THCW makes this point pretty well.

  2. It's tricky to communicate problems in AI alignment―people come in with lots of different assumptions about what kind of things are easy / hard, and it's hard to resolve disagreements because we don't have a live AGI to do experiments on. I think THCW and related essays you criticize are actually great resources. They don't try to communicate the entire problem at once because that's infeasible. The fact that human values are complex and hard to specify explicitly is part of the reason why alignment is hard, where alignment means get the AI to care about human values, not get an AI to answer questions about moral behavior reasonably.

  3. You claim the existence of GPT-4 is evidence against the claims in THCW. But IMO GPT-4 fits in neatly with THCW. The post even starts with a taxonomy of genies:

There are three kinds of genies: Genies to whom you can safely say "I wish for you to do what I should wish for"; genies for which no wish is safe; and genies that aren't very powerful or intelligent.

GPT-4 is an example of a genie that is not very powerful or intelligent.

  1. If in 5 years we build firefighter LLMs that can rescue mothers from burning buildings when you ask them to, that would also not show that we've solved value specification - it's just a didactic example, not a full description of the actual technical problem. More broadly, I think it's plausible that within a few years LLM will be able to give moral counsel far better than the average human. That still doesn't solve value specification any more than the existence of humans that could give good moral counsel 20 years ago had solved value specification.

  2. If you could come up with a simple action-value function Q(observation, action), that when maximized over actions yields a good outcome for humans, then I think that would probably be helpful for alignment. This is an example of a result that doesn't directly make an AI care about human values, but would probably lead to progress in that direction. I think if it turned out to be easy to formalize such a Q then I would change my mind about how hard value specification is.

  3. While language models understand human values to some extent, they aren't robust. The RHLF/RLAIF family of methods is based on using an LLM as a reward model, and to make things work you need to be careful not to optimize too hard or you'll just get gibberish (Gao et al. 2022). LLMs don't hold up against mundane RLHF optimization pressure, nevermind an actual superintelligence. (Of course, humans wouldn't hold up either).

Broadly agree with the takes here.

However, these results seem explainable by the widely-observed tendency of larger models to learn faster and generalize better, given equal optimization steps.

This seems right and I don't think we say anything contradicting it in the paper.

I also don't see how saying 'different patterns are learned at different speeds' is supposed to have any explanatory power. It doesn't explain why some types of patterns are faster to learn than others, or what determines the relative learnability of memorizing versus generalizing patterns across domains. It feels like saying 'bricks fall because it's in a brick's nature to move towards the ground': both are repackaging an observation as an explanation.

The idea is that the framing 'learning at different speeds' lets you frame grokking and double descent as the same thing. More like generalizing 'bricks move towards the ground' and 'rocks move towards the ground' to 'objects move towards the ground'. I don't think we make any grand claims about explaining everything in the paper, but I'll have a look and see if there's edits I should make - thanks for raising these points.

There are positive feedback loops between prongs:

  • Successfully containing & using more capable models (p1) gives you more scary demos for p2
  • Success in p1 also speeds up p3 a lot, because:
    • 1) You can empirically study AGI directly, 
    • 2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4)
    • 3) Maybe you can even have (proto-)AGIs do research for you
  • You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour it into capabilities research.
  • Better alignment research (p3) lets you do more p1 type risky stuff with SOTA models (on the margin)


If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.

A three-pronged approach to AGI safety. (This is assuming we couldn't just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).

Prong 1: boxing & capability control (aka ‘careful bootstrapping’)

  • Make the AGI as capable as possible, under the constraint that you can make sure it can’t break out of the box or do other bad stuff. 
  • Assume the AGI is misaligned. Be super paranoid
  • Goal: get useful work out of boxed AGIs.
    • For example, AIs might be able to do interpretability really well.
    • More generally, for any field there is an effective 2x-10x research productivity boost you can probably get, depending on the field (maybe 100x? unclear, I’m sceptical).
  • Goal: learn more about AGI by studying the systems you've built.
  • To test your security & oversight procedures, you run tests where you train your AGI to escape in a sandbox.
  • This is operationally hard and needs lots of buy-in
  • Research directions: scalable oversight, interpretability for oversight, auditing, adversarial training, capability control / “unlearning”, scaling laws & capabilities forecasting.

Prong 2: scary demos and and convincing people that AGI is dangerous

  • Goal 1: shut it all down, or failing that slow down capabilities research.
  • Goal 2: get operational & political support for the entire approach, which is going to need lots of support, esp first prong
  • In particular make sure that research productivity boosts from AGI don’t feed back into capabilities research, which requires high levels of secrecy + buy-in from a large number of people.
    • Avoiding a speed-up is probably a little bit easier than enacting a slow-down, though maybe not much easier.
  • Demos can get very scary if we get far into prong 1, e.g. we have AGIs that are clearly misaligned or show that they are capable of breaking many of our precautions.

Prong 3: alignment research aka “understanding minds”

  • Goal: understand the systems well enough to make sure they are at least corrigible, or at best ‘fully aligned’.
  • Roughly this involves understanding how the behaviour of the system emerges in enough generality that we can predict and control what happens once the system is deployed OOD, made more capable, etc.
  • Relevant directions: agent foundations / embedded agency, interpretability, some kinds of “science of deep learning”
Load More