Daniel Kokotajlo

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, now OpenAI Futures/Governance team. Views are my own & do not represent those of my employer. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Two of my favorite memes:


(by Rob Wiblin)

My EA Journey, depicted on the whiteboard at CLR:

 

Sequences

Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions

Comments

To me, a solution to inner alignment would mean that we've solved the problem of malign generalization. To be a bit more concrete, this roughly means that we've solved the problem of training an AI to follow a set of objectives in a way that generalizes to inputs that are outside of the training distribution, including after the AI has been deployed. 


This is underspecified, I think, since we have for years had AIs that follow objectives in ways that generalize to inputs outside of the training distribution. The thing is there are lots of ways to generalize / lots of objectives they could learn to follow, and we don't have a good way of pinning it down to exactly the ones we want. (And indeed as our AIs get smarter there will be new ways of generalizing / categories of objectives that will become available, such as "play the training game")

So it sounds like you are saying "A solution to inner alignment mans that we've figured out how to train an AI to have the objectives we want it to have, robustly such that it continues to have them way off distribution." This sounds like basically the whole alignment problem to me?

I see later you say you mean the second thing -- which is interestingly in between "play the training game" and "actually be honest/helpful/harmless/etc." (A case that distinguishes it from the latter: Suppose it is reading a paper containing an adversarial example for the RM, i.e. some text it can output that causes the RM to give it a high score even though the text is super harmful / dishonest / etc. If it's objective is the "do what the RM would give high score to if it was operating normally" objective, it'll basically wirehead on that adversarial example once it learns about it, even if it's in deployment and it isn't getting trained anymore, and even though it's an obviously harmful/dishonest piece of text.

It's a nontrivial and plausible claim you may be making -- that this sort of middle ground might be enough for safe AGI, when combined with the rest of the plan at least. But I'd like to see it spelled out. I'm pretty skeptical right now.

So, IIUC, you are proposing we:

  • Literally just query GPT-N about whether [input_outcome] is good or bad
  • Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
  • Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
  • Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!

Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."?

So, IIUC, you are proposing we:

  • Literally just query GPT-N about whether [input_outcome] is good or bad
  • Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
  • Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
  • Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!

Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."?

So, IIUC, you are proposing we:

  • Literally just query GPT-N about whether [input_outcome] is good or bad
  • Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
  • Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
  • Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!

Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."? 

 

So, IIUC, you are proposing we:

  • Literally just query GPT-N about whether [input_outcome] is good or bad
  • Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
  • Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
  • Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!

Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of "producing outcomes the RM classifies as good?" Or the objective "producing outcomes the RM would classify as good if it was operating normally?" (the difference revealing itself in cases of tampering with the RM) Or the objective "producing outcomes that are good-for-humans, harmless, honest, etc."? 

 

I would say that current LLMs, when prompted and RLHF'd appropriately, and especially when also strapped into an AutoGPT-type scaffold/harness, DO want things. I would say that wanting things is a spectrum and that the aforementioned tweaks (appropriate prompting, AutoGPT, etc.) move the system along that spectrum. I would say that future systems will be even further along that spectrum. IDK what Nate meant but on my charitable interpretation he simply meant that they are not very far along the spectrum compared to e.g. humans or prophecied future AGIs.

It's a response to "LLMs turned out to not be very want-y, when are the people who expcted 'agents' going to update?" because it's basically replying "I didn't expect LLMs to be agenty/wanty; I do expect agenty/wanty AIs to come along before the end and indeed we are already seeing progress in that direction."

To the people saying "LLMs don't want things in the sense that is relevant to the usual arguments..." I recommend rephrasing to be less confusing: Your claim is that LLMs don't seem to have preferences about the training objective, or that are coherent over time, unless hooked up into a prompt/scaffold that explicitly tries to get them to have such preferences. I agree with this claim, but don't think it's contrary to my present or past models.

 

FWIW, your proposed pitch "it's already the case that..." is almost exactly the elevator pitch I currently go around giving. So maybe we agree? I'm not here to defend Nate's choice to write this post rather than some other post.

 

I had a nice conversation with Ege today over dinner, in which we identified a possible bet to make! Something I think will probably happen in the next 4 years, that Ege thinks will probably NOT happen in the next 15 years, such that if it happens in the next 4 years Ege will update towards my position and if it doesn't happen in the next 4 years I'll update towards Ege's position.

Drumroll...

I (DK) have lots of ideas for ML experiments, e.g. dangerous capabilities evals, e.g. simple experiments related to paraphrasers and so forth in the Faithful CoT agenda. But I'm a philosopher, I don't code myself. I know enough that if I had some ML engineers working for me that would be sufficient for my experiments to get built and run, but I can't do it by myself. 

When will I be able to implement most of these ideas with the help of AI assistants basically substituting for ML engineers? So I'd still be designing the experiments and interpreting the results, but AutoGPT5 or whatever would be chatting with me and writing and debugging the code.

I think: Probably in the next 4 years. Ege thinks: probably not in the next 15.

Ege, is this an accurate summary?

I agree that arguments for AI risk rely pretty crucially on human inability to notice and control what the AI wants. But for conceptual clarity I think we shouldn't hardcode that inability into our definition of 'wants!' Instead I'd say that So8res is right that ability to solve long-horizon tasks is correlated with wanting things / agency, and then say that there's a separate question of how transparent and controllable the wants will be around the time of AGI and beyond. This then leads into a conversation about visible/faithful/authentic CoT, which is what I've been thinking about for the past six months and which is something MIRI started thinking about years ago. (See also my response to Ryan elsewhere in this thread)

 

Thanks for the explanation btw.

My version of what's happening in this conversation is that you and Paul are like "Well, what if it wants things but in a way which is transparent/interpretable and hence controllable by humans, e.g. if it wants what it is prompted to want?" My response is "Indeed that would be super safe, but it would still count as wanting things. Nate's post is titled "ability to solve long-horizon tasks correlates with wanting" not "ability to solve long-horizon tasks correlates with hidden uncontrollable wanting."

One thing at time. First we establish that ability to solve long-horizon tasks correlates with wanting, then we argue about whether or not the future systems that are able to solve diverse long-horizon tasks better than humans can will have transparent controllable wants or not. As you yourself pointed out, insofar as we are doing lots of RL it's dubious that the wants will remain as transparent and controllable as they are now. I meanwhile will agree that a large part of my hope for a technical solution comes from something like the Faithful CoT agenda, in which we build powerful agentic systems whose wants (and more generally, thoughts) are transparent and controllable.

Load More