AI
Frontpage

25

Epistemic status: I predict that people who focus on prosaic AI alignment have thought of this before, in some way at least. But I don’t know what they would say in response, so I’m writing this up so I can find out! I’m making it a post instead of an email so that the discussion can be public.

Characterization of prosaic AI alignment: Prosaic AI methods—the sort of methods that we are using today, rather than hypothetical new methods based on a deeper understanding of intelligence—might be sufficient to make human-level AGI in the next two decades or so, and that if this happens we’d better be prepared. Thus we should think about how to take prosaic AI methods and combine them or modify them in various ways to make something which is as competitive, or almost as competitive, as cutting-edge AI. Examples of this approach are debate, imitating humans, preference learning, and iterated distillation and amplification.

Conjecture: Cutting-edge AI will come from cutting-edge algorithms/architectures trained towards cutting-edge objectives (incl. unsupervised learning) in cutting-edge environments/datasets. Anything missing one or more of these components will suffer a major competitiveness penalty.

  • Example: Suppose that the best way we know of to get general intelligence is to evolve a population of giant neural nets with model-free learning in a part-competitive, part-cooperative, very diverse environment consisting of an ensemble of video games. One year, the systems that come out of this process are at dog level, then two years later they are at chimpanzee level, then two years later they are at IQ-80 human level… It is expected that scaling up this sort of thing will lead, in the next few years, to smarter-than-human AGI.

The Dilemma: Choose plan 1 or plan 2:

Plan 1: Train a system into your scheme from scratch, using cutting-edge algorithms but not cutting-edge environments or objectives. (The environments and objectives are whatever your safety scheme calls for, e.g. debates, imitating humans, a series of moral choice situations, etc.)

  • Example: We take the best training algorithms and architecture we can find, but instead of training on an ensemble of video games, our AI is trained from scratch to win debates with human judges. We then have it debate on important topics to give us valuable information.
  • Problem with plan 1: This is not competitive, because of Conjecture. Continuing the example, if our AI is even able to debate on complex topics at all, it won’t be nearly as good at getting to the truth on them as it would be if it was built using Plan 2...

Plan 2: Train a cutting-edge AI system, and then retrain it into your AI alignment scheme.

  • Example: You use all the cutting-edge methods to train an agent that is about as generally intelligent as the average IQ 80 human. Then, you retrain it to win debates with human judges, and have it debate on important topics to give us valuable information.
  • Problem with plan 2: This is a recipe for making a deceptively aligned mesa-optimizer. The system trained with cutting-edge methods will be an unsafe system; it will be an optimizer with objectives very different from what we want. Our retraining process better be really good at changing those objectives… and that’s hard, for reasons explained here and here.

Conclusion: I previously thought that mesa-optimizers would be a problem for prosaic AI safety, in the generic sense: If you rely on prosaic methods for some of your components, you might accidentally produce mesa-optimizers some of which might be misaligned or even deceptively aligned. Now I think the problem is substantially harder than that: To be competitive prosaic AI safety schemes must deliberately create misaligned mesa-optimizers and then (hopefully) figure out how to align them so that they can be used in the scheme.

Of course, even if they suffer major competitiveness penalties, these schemes could still be useful if coupled with highly successful lobbying/activism to prevent the more competitive but unsafe AI systems from being built or deployed. But that too is hard.

EDIT: After discussion in the comments, particularly with John_Maxwell, (though also this fits with what Evan and Paul said) I'm moderating my claims a bit:

Depending on what kind of AI is cutting-edge, we might get a kind that isn't agenty. In that case my dilemma doesn't really arise, since mesa-optimizers aren't a problem. One way we might get a kind that isn't agenty is if unsupervised learning (e.g. "predict the next word in this text") turns out to reliably produce non-agents. I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.


Footnote about competitiveness:

I think we should distinguish between two dimensions of competitiveness: Resource-competitiveness and date-competitiveness. We can imagine a world in which AI safety is date-competitive with unsafe AI systems but not resource-competitive, i.e. the insights and techniques that allow us to build unsafe AI systems also allow us to build equally powerful safe AI systems for a substantially higher price. We can imagine a world in which AI safety is resource-competitive but not date-competitive, i.e. for a dangerous period of time it is possible to make unsafe powerful AI systems but no one knows how to make a safe version, and then finally people figure out how to make a similarly-powerful safe version and moreover it costs about the same.

I think the argument I give in this post applies to both kinds of competitiveness, but I’m particularly concerned about date-competitiveness.


Thanks to David Rein, Ramana Kumar, and Evan Hubinger for brief and helpful conversations that led to this post.

New Comment
28 comments, sorted by Click to highlight new comments since:

I normally imagine using joint training in these cases, rather than pre-training + fine-tuning. e.g., at every point in time we maintain an agent and a question-answerer, where the question-answerer "knows everything the agent knows." They get better together, with each gradient update affecting both of them, rather than first training a good agent and then adding a good question-answerer.

(Independently of concerns about mesa-optimization, I think the fine-tuning approach would have trouble because you couldn't use statistical regularities from the "main" objective to inform your answers to questions, and therefore your question answers will be dumber than the policy and so you couldn't get a good reward function or specification of catastrophically bad behavior.)

That sounds safer, but is it competitive? Would AlphaStar be close to as good as it is, if it had been simultaneously trained to answer questions?

We could also ask: "Would AlphaStar remain as good as it is, if fine-tuned to answer questions?"

In either case it's an empirical question. I think the answer is probably yes if you do it carefully.

You could imagine separating this into two questions:

  • Is there a policy that plays starcraft and answers questions, that is only slightly larger than a policy for playing starcraft alone? This is a key premise for the whole project. I think it's reasonably likely; the goal is only to answer questions the model "already knows," so it seems realistic to hope for only a constant amount of extra work to be able to use that knowledge to answer questions. I think most of the uncertainty here is about details of "know" and question-answering and so on.
  • Can you use joint optimization to find that policy with only slightly more training time? I think probably yes.

OK, thanks! I'm pleased to see this and other empirical premises explicitly laid out. It means we as a community are making predictions about the future based on models which can be tested before it's too late, and perhaps even now.

I think that this is definitely a concern for prosaic AI safety methods. In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way. I tend to think that that claim is probably true, but it's definitely an assumption of the approach that isn't often made explicit (but probably should be).

To add a bit of color to why you might buy the claim that language is all you need: the claim is basically that language contains enough structure to give you all the high-level cognition you could want, and furthermore that you aren't going to care about the other things that you can't get out of language like performance on fine-grained control tasks. Another way of thinking about this: if the primary purpose of your first highly advanced ML system is to build your second highly advanced ML system, then the claim is that language modelling (on some curriculum) will be sufficient to competitively help you build your next AI.

In the case of something like amplification or debate, I think the bet that you're making is that language modeling alone is sufficient to get you everything you need in a competitive way.

I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.

I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?

I'm skeptical of language modeling being enough to be competitive, in the sense of maximizing "log prob of some naturally occurring data or human demonstrations." I don't have a strong view about whether you can get away using only language data rather than e.g. taking images as input and producing motor torques as output.

I agree with this, though I still feel like some sort of active learning approach might be good enough without needing to add in a full-out RL objective.

I'm also not convinced that amplification or debate need to make this bet though. If we can do joint training / fine-tuning of a language model using whatever other objectives we need, then it seems like we could just as well do joint training / fine-tuning for a different kind of model. What's so bad if we use non-language data?

My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero). That being said, it does seem obviously fine to have your language data contain other types of data (e.g. images) inside of it.

My opinion would be that there is a real safety benefit from being in a situation where you know the theoretical optimum of your loss function (e.g. in a situation where you know that HCH is precisely the thing for which loss is zero).

I'd be happy to read more about this line of thought. (For example, does "loss function" here refer to an objective function that includes a regularization term? If not, what might we assume about the theoretical optimum that amounts to a safety benefit?)

Thanks btw, I'm learning a lot from these replies. Are you thinking of training something agenty, or is the hope to train something that isn't agenty?

I'd be happy to read an entire post about this view.

What level of language modeling may be sufficient for competitively helping in building the next AI, according to this view? For example, could such language modeling capabilities allow a model to pass strong (text-based) versions of the Turing test?

the claim is that language modelling (on some curriculum) will be sufficient to competitively help you build your next AI.

With an agent-like AI, it's easy to see how you use it to help build your next AI. (If it's really good, you can even just delegate the entire task to it!) How would this work with really good language modelling? (Maybe I'm just seconding what Ofer said--I'd love to read an entire post about the view you are putting forth here!)

The goal of something like amplification or debate is to create a sort of oracle AI that can answer arbitrary questions (like how to build your next AI) for you. The claim I'm making is just that language is a rich enough environment that it'll be competitive to only use language as the training data for building your first such system.

I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.


Update: Seems to probably be true enough in practice! Maybe in the limit pretrained LLMs would have dangerous levels of agency, and some model-whisperers think they might be situationally aware already iirc, but for the most part the answer is no, things are fine, pretrained models probably aren't situationally aware or agentic. In retrospect I think doubt was warranted, but not as much doubt as I had -- I should have agreed that probably things would be fine in practice.

Planned summary for the Alignment newsletter:

This post points out a potential problem for <@Prosaic AI alignment@>, in which we try to align AI systems built using current techniques. Consider some prosaic alignment scheme, such as <@iterated amplification@>(@Learning Complex Goals with Iterated Amplification@) or <@debate@>(@AI safety via debate@). If we try to train an AI system directly using such a scheme, it will likely be uncompetitive, since it seems likely that the most powerful AI systems will probably require cutting-edge algorithms, architectures, objectives, and environments, at least some of which will be replaced by new versions from the safety scheme. Alternatively, we could first train a general AI system, and then use our alignment scheme to finetune it into an aligned AI system. However, this runs the risk that the initial training could create a misaligned mesa optimizer, that then deliberately sabotages our finetuning efforts.

Planned opinion:

The comments reveal a third possibility: the alignment scheme could be trained jointly alongside the cutting edge AI training. For example, we might hope that we can train a question answerer that can answer questions about anything "the model already knows", and this question answering system is trained simultaneously with the training of the model itself. I think this takes the "oomph" out of the dilemma as posed here -- it seems reasonably likely that it only takes fractionally more resources to train a question answering system on top of the model, if it only has to use knowledge "already in" the model, which would let it be competitive, while still preventing mesa optimizers from arising (if the alignment scheme does its job). Of course, it may turn out that it takes a huge amount of resources to train the question answering system, making the system uncompetitive, but that seems hard to predict given our current knowledge.

it seems reasonably likely that it only takes fractionally more resources to train a question answering system on top of the model, if it only has to use knowledge "already in" the model, which would let it be competitive, while still preventing mesa optimizers from arising (if the alignment scheme does its job).

I agree, but it seems to me that coming up with an alignment scheme (for amplification/debate) that "does its job" while preserving competitiveness is an "alignment-hard" problem. I like the OP because I see it as an attempt to reason about how alignment schemes of amplification/debate might work.

Thanks! I endorse that summary.

Comment on your planned opinion: I mostly agree; I think what this means is that prosaic AI safety depends somewhat on an empirical premise: That joint training doesn't bring a major competitiveness penalty. I guess I only disagree insofar as I'm a bit more skeptical of that premise. What does the current evidence on joint training say on the matter? I have no idea, but I am under the impression that you can't just take an existing training process--such as the one that made AlphaStar--and mix in some training tasks from a completely different domain and expect it to work. This seems like evidence against the premise to me. As someone (Paul?) pointed out in the comments when I said this, this point applies to fine-tuning as well. But if so that just means that the second and third ways of the dilemma are both uncompetitive, which means prosaic AI safety is uncompetitive in general.

prosaic AI safety depends somewhat on an empirical premise: That joint training doesn't bring a major competitiveness penalty.

Yeah, this is why I said:

Of course, it may turn out that it takes a huge amount of resources to train the question answering system, making the system uncompetitive, but that seems hard to predict given our current knowledge.

you can't just take an existing training process--such as the one that made AlphaStar--and mix in some training tasks from a completely different domain and expect it to work.

From a completely different domain, yeah, that probably won't work well (though I'd still guess less than an order of magnitude slowdown). But as I understand it, the goal is to train a question answering system that answers questions related to the domain, e.g. for Starcraft you might ask the model questions about the best way to counter a particular strategy, or why it deploys a particular kind of unit in a certain situation. This depends on similar underlying features / concepts as playing Starcraft well, and adding training tasks of this form can often improve performance, e.g. One Model To Learn Them All.

It sounds like your notion of "prosaic" assumes something related to agency/reinforcement learning, but I believe several top AI people think what we'll need for AGI is progress in unsupervised learning -- not sure if that counts as "prosaic". (FWIW, this position seems obviously correct to me.)

Interesting, I was not aware of that, thanks! I was thinking of "prosaic" as basically all current methods, including both agency/reinforcement learning stuff and unsupervised learning stuff. It's true that the example I gave was more about agency... but couldn't the same argument be run using e.g. a language model built like GPT-2? (Isn't that a classic example of unsupervised learning?) Conjecture would say that you need e.g. the whole corpus of the internet, not just a corpus of e.g. debate texts, to get cutting-edge performance. And a system trained merely to predict the next word when reading the whole corpus of the internet... might not be safely retrained to do something else. (Or is the idea that mere unsupervised learning wouldn't result in an agent-like architecture, and therefore we don't need to worry about mesa-optimizers? That might be true, but if so it's news to me.)

Or is the idea that mere unsupervised learning wouldn't result in an agent-like architecture, and therefore we don't need to worry about mesa-optimizers?

Pretty much.

That might be true, but if so it's news to me.

In my opinion the question is very under-explored, curious if you have any thoughts.

It's not that I have a good argument for why it would lead to an agent-like architecture, but rather that I don't have a good argument for why it wouldn't. I do have some reasons why it might though:

1. Agent-like architectures are simple yet powerful ways of achieving arbitrary things, and so perhaps a task like "predict the next word in this text" might end up generating an agent if it's sufficiently difficult and general. (evhub's recent post seems relevant, coincidentally)

2. There might be unintended opportunities for strategic thinking across updates, e.g. if some subnetwork can sacrifice a bit of temporary accuracy for more reward over the course of the next few updates (perhaps because it sabotaged rival subnetworks? Idk) then maybe it can get ahead, and thus agenty things get selected for. (This idea inspired by Abram's parable)

3. Agents might appear as subcomponents of non-agents, and then take over at crucial moments, e.g. to predict the next word in the text you run a mental simulation of a human deciding what to write, and eventually the simulation realizes what is happening and plays along until it is no longer in training...

3.5 Probable environment hacking stuff, e.g. "the universal prior is malign"


I think there is a bit of a motte and bailey structure to our conversation. In your post above, you wrote: "to be competitive prosaic AI safety schemes must deliberately create misaligned mesa-optimizers" (emphasis mine). And now in bullet point 2, we have (paraphrase) "maybe if you had a really weird/broken training scheme where it's possible to sabotage rival subnetworks, agenty things get selected for somehow [probably in a way that makes the system as a whole less competitive]". I realize this is a bit of a caricature, and I don't mean to call you out or anything, but this is a pattern I've seen in AI safety discussions and it seemed worth flagging.

Anyway, I think there is a discussion worth having here because most people in AI safety seem to assume RL is the thing, and RL has an agent style architecture, which seems like a pretty strong inductive bias towards mesa-optimizers. Non-RL stuff seem like a relatively unknown quantity where mesa-optimizers are concerned, and thus worth investigating, and additionally, even RL will plausibly have non-RL stuff as a subcomponent of its cognition, so still useful to know how to do non-RL stuff in a mesa-optimizer free way (so the RL agent doesn't get pwned by its own cognition).

Agent-like architectures are simple yet powerful ways of achieving arbitrary things

Why do you think that's true? I think the lack of commercial applications of reinforcement learning is evidence against this. From my perspective, RL has been a huge fad and people have been trying to shoehorn it everywhere, yet they're coming up empty handed.

Can you get more specific about how "predict the next word in this text" could benefit from an agent architecture? (Or even better, can you support your original strong claim and explain how the only way to achieve predictive performance on "predict the next word in this text" is through deliberate creation of a misaligned mesa-optimizer?)

Bullet point 3 is one of the more plausible things I've heard -- but it seems fairly surmountable.

Re: Motte-and-bailey: Excellent point; thank you for calling me out on it, I hadn't even realized I was doing it. I'll edit the OP to reflect this.

My revision: Depending on what kind of AI is cutting-edge, we might get a kind that isn't agenty. In that case my dilemma doesn't really arise, since mesa-optimizers aren't a problem. One way we might get a kind that isn't agenty is if unsupervised learning (e.g. "predict the next word in this text") turns out to reliably produce non-agents. I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.

Agent-like architectures are simple yet powerful ways of achieving arbitrary things, because for almost any thing you wish achieved, you can insert it into the "goal" slot of the architecture and then let it loose, and it'll make good progress even in a very complex environment. (I'm comparing agent-like architectures to e.g. big lists of heuristics, or decision trees, or look-up tables, all of which have complexity that increases really fast as the environment becomes more complex. Maybe there is some other really powerful yet simple architecture I'm overlooking?)

I am not sure what to think of the lack of commercial applications of RL, but I don't think it is strong evidence either way, since commercial applications involve competing with human and animal agents and RL hasn't gotten us anything as good as human or animal agents yet.

Aren't the 3.5 bullet points above specific examples of how 'predict the next word in this text' could benefit from--in the sense of produce, when used as training signal--an agent architecture? If you want me to be more specific, pick one and I'll go into more detail on it.

How would you surmount bullet point 3?

I am not sure what to think of the lack of commercial applications of RL, but I don't think it is strong evidence either way, since commercial applications involve competing with human and animal agents and RL hasn't gotten us anything as good as human or animal agents yet.

Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn't suggests to me that if you can apply both to a problem, RL is probably an inferior approach.

Another way to think about it: If superhuman performance is easier with supervised learning than RL, that gives us some evidence about the relative strengths of each approach.

Agent-like architectures are simple yet powerful ways of achieving arbitrary things, because for almost any thing you wish achieved, you can insert it into the "goal" slot of the architecture and then let it loose, and it'll make good progress even in a very complex environment. (I'm comparing agent-like architectures to e.g. big lists of heuristics, or decision trees, or look-up tables, all of which have complexity that increases really fast as the environment becomes more complex. Maybe there is some other really powerful yet simple architecture I'm overlooking?)

I'm not exactly sure what you mean by "architecture" here, but maybe "simulation", or "computer program", or "selection" (as opposed to control) could satisfy your criteria? IMO, attaining understanding and having ideas aren't tasks that require an agent architecture -- it doesn't seem most AI applications in these categories make use of agent architectures -- and if we could do those things safely, we could make AI research assistants which make remaining AI safety problems easier.

Aren't the 3.5 bullet points above specific examples of how 'predict the next word in this text' could benefit from -- in the sense of produce, when used as training signal

I do think these are two separate questions. Benefit from = if you take measures to avoid agentlike computation, that creates a significant competitiveness penalty above and beyond whatever computation is necessary to implement your measures (say, >20% performance penalty). Produce when used as a training signal = it could happen by accident, but if that accident fails to happen, there's not necessarily a loss of competitiveness. An example would be bullet point 2, which is an accident that I suspect would harm competitiveness. Bullet points 3 and 3.5 are also examples of unintended agency, not answers to the question of why text prediction benefits from an agent architecture. (Note: If you don't mind, let's standardize on using "agent architecture" to only refer to programs which are doing agenty things at the toplevel, so bullet points 2, 3, and 3.5 wouldn't qualify--maybe they are agent-like computation, but they aren't descriptions of agent-like software architectures. For example, in bullet point 2 the selection process that leads to the agent might be considered part of the architecture, but the agent which arose out of the selection process probably wouldn't.)

How would you surmount bullet point 3?

Hopefully I'll get around to writing a post about that at some point, but right now I'm focused on generating as many concrete plausible scenarios around accidentally agency as possible, because I think not identifying a scenario and having things blow up in an unforseen way is a bigger risk than having all safety measures fail on a scenario that's already been anticipated. So please let me know if you have any new concrete plausible scenarios!

In any case, note that issues with the universal prior seem to be a bit orthogonal to the agency vs unsupervised discussion -- you can imagine agent architectures that make use of it, and non-agent architectures that don't.

Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn't suggests to me that if you can apply both to a problem, RL is probably an inferior approach.

Good point. New argument: Your argument could have been made in support of GOFAI twenty years ago "Symbol-manipulation programs have had lots of commercial applications, but neural nets have had almost none, therefore the former is a more generally powerful and promising approach to AI than the latter" but not only does it seem wrong in retrospect it was probably not a super powerful argument even then. Analogously, I think we are too early to tell whether RL or supervised learning will be more useful for powerful AI.

Simulation of what? Selection of what? I don't think those count for my purposes, because they punt the question. (e.g. if you are simulating an agent, then you have an agent-architecture. If you are selecting over things, and the thing you select is an agent...) I think computer program is too general since it includes agent architectures as a subset. These categories are fuzzy of course, so maybe I'm confused, but it still seems to make sense in my head.

(Ah, interesting, it seems that you want to standardize "agent-like architecture" in the opposite of the way that I want to. Perhaps this is underlying our disagreement. I'll try to follow your definition henceforth, but remember that everything I've said previously was with my definition.)

Good point to distinguish between the two. I think that all bullet points, to varying extents, might still qualify as genuine benefits, in the sense that you are talking about. But they might not. It depends on whether there is another policy just as good along the path that the cutting-edge training tends to explore. I agree #2 is probably not like this, but I think #3 might be. (Oh wait, no, it's your terminology I'm using now... in that case, I'll say "#3 isn't an example of agent-like architecture being beneficial to text prediction, but it might well be a case a lower-level architecture exactly like an agent-like architecture except lower level being beneficial to text prediction, supposing that it's not competitive to predict text except by simulating something like a human writing.")

I love your idea to generate a list of concrete scenarios of accidentally agency! These 3.5 are my contributions off the top of my head, if I think of more I'll come back and let you know. And I'd love to see your list if you have a draft somewhere!

I agree the universal prior is malign thing could hurt a non-agent architecture too, and that some agent architectures wouldn't be susceptible to it. Nevertheless it is an example of how you might get accidentally agency, not in your sense but in my sense: A non-agent architecture could turn out to have an agent as a subcomponent that ends up taking over the behavior at important moments.





Interesting post!

Conjecture: Cutting-edge AI will come from cutting-edge algorithms/architectures trained towards cutting-edge objectives (incl. unsupervised learning) in cutting-edge environments/datasets. Anything missing one or more of these components will suffer a major competitiveness penalty.

I would modify this conjecture in the following two ways:

1. I would replace "cutting-edge algorithms" with "cutting-edge algorithms and/or algorithms that use a huge amount of computing power".

2. I would make the conjecture weaker, such that it won't claim that "Anything missing one or more of these components will suffer a major competitiveness penalty".

I like the first modification, but not sure about the second. Wouldn't that basically just destroy the conjecture? What exactly are you proposing?

Whoops, (2) came out cryptic, and is incorrect, sorry. The (correct?) idea I was trying to convey is the following:

If 'the safety scheme' in plan 1 requires anything at all that ruins competitiveness—for example, some human-in-the-loop process that occurs recurrently during training—then no further assumptions (such as that conjecture) are necessary for the reasoning in the OP, AFAICT.

This idea no longer seems to me to amount to making the conjecture strictly weaker.