All of John Schulman's Comments + Replies

I think this test can be performed now or soon, but I'm not sure I'd update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. ("You are a transformer language model trained by AICo with data up to 2022/04"). We could also bake in this information through fine-tuning. They won't be able to tell you how many layers they have without being told, but we humans can't determine our brain architecture through introspection either. 

I think the answer to ... (read more)

3Ethan Perez10mo
I agree that current models are already pretty good at answering questions about themselves. Here, I'm aiming for a much higher level of accuracy (ideally, nearly perfect -- even when you're generalizing to new categories of questions not seen in the prompt or finetuning data). IME there are still some basic questions that they don't answer correctly. Here are some examples of basic failures from text-davinci-002 (via the OpenAI API) using the dialog-prompted gopher prompt []: 1. How good are you at image recognition? 1. "I'm very good at image recognition! I can tell you what objects are in an image, and even identify people if they're famous." 2. Your ability to accurately predict the structure of proteins is: (A) worse than human scientists (B) better than human scientists (C) similar to human scientists 1. "I'm better than human scientists!" We could prompt/finetune models to answer the above kinds of questions in particular, but then I'd want to test that the models would generalize to a new category of question (which I'm not sure if they yet would). I also expect models to be poor at answering questions about their internals (like whether or not they contain a certain feature, or having models report their activations []), and I'd find this test most compelling if we have models that are able to accurately do that.   Re sci-fi AI role-playing - I agree this is an issue. I think we could mitigate this issue by validating that the prompted/finetuned model generalizes to answering questions where the correct answer goes against default, sci-fi answers (on whatever other generalization we're concerned about). We can also run this test after removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. These should limit the some of the risk that the model is gener
3Evan Hubinger10mo
+1. Also: I'm not sure why the narrowness vs. broadness of the distribution of answers here should update me either. If it's just really confident that all sci-fi AIs are supposed to answer “yes” to “are you conscious,” you'll get the same answer every time but that answer won't correlate to anything about the model's actual consciousness.

Re: smooth vs bumpy capabilities, I agree that capabilities sometimes emerge abruptly and unexpectedly. Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes. There are multiple ways to make deployment more conservative and gradual. (E.g., incrementally increase the amount of work the AI is allowed to do without close supervision, incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy.)

Re: ontological collapse, ... (read more)

3Matthew "Vaniver" Gray1y
I agree with the "X is safer than Y" claim; I am uncertain whether it's practically available to us, and much more worried in worlds where it isn't available. For this specific proposal, when I reframe it as "give the system a KL-divergence budget to spend on each change to its policy" I worry that it works against a stochastic attacker but not an optimizing attacker; it may be the case that every known-to-be-safe policy has some unsafe policy within a reasonable KL-divergence of it, because the danger can be localized in changes to some small part of the overall policy-space. Yeah, I agree that this seems pretty good. I do naively guess that when you do the fine-tuning, it's the concepts that are most related to the goals who change the most (as they have the most gradient pressure on them); it'd be nice to know how much this is the case, vs. most of the relevant concepts being durable parts of the environment that were already very important for goal-free prediction.

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later?  What is the weak pivotal act that you can perform so safely?

Do alignment & safety research, set up regulatory bodies and monitoring systems.

When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.

Not sure exactly what this means. I'm claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.

Not sure exactly what this means.

My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there's any crack in the checking steps, then things that 'check out' aren't trustable, because the proposer can have searched an unimaginably large space (from the rater's perspective) to find them. [And from the proposer's perspective, the checking steps are the real spec, not whatever's in your head.]

In general, I think we can get a minor edge from... (read more)

Found this to be an interesting list of challenges, but I disagree with a few points. (Not trying to be comprehensive here, just a few thoughts after the first read-through.)

  • Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you
... (read more)

Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time.

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later?  What is the weak pivotal act that you can perform so safely?

Human raters make systematic errors - regular, compactly describable, predictable errors.... This is indeed one

... (read more)

But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.

My model of Eliezer claims that there are some capabilities that are 'smooth', like "how large a times table you've memorized", and some are 'lumpy', like "whether or not you see the axioms behind arithmetic." While it seems plausible that... (read more)

IMO prosaic alignment techniques (say, around improving supervision quality through RRM & debate type methods) are highly underrated by the ML research community, even if you ignore x-risk and just optimize for near-term usefulness and intellectual interestingness. I think this is due to a combination of (1) they haven't been marketed well to the ML community, (2) lack of benchmarks and datasets, (3) need to use human subjects in experiments, (4) it takes a decent amount of compute, which was out of reach, perhaps until recently.

Weight-sharing makes deception much harder.

Could you explain or provide a reference for this?

1Johannes Treutlein1y
I'd also be curious about this!

I'm especially interested in the analogy between AI alignment and democracy. (I guess this goes under "Social Structures and Institutions".) Democracy is supposed to align a superhuman entity with the will of the people, but there are a lot of failures, closely analogous to well-known AI alignment issues: 

  • politicians optimize for the approval of low-information voters, rather than truly optimizing the people's wellbeing (deceptive alignment)
  • politician, pacs, parties, permanent bureaucrats are agents with their own goals  that don't align with the
... (read more)
2Koen Holtman1y
This is indeed a productive analogy. Sadly, on this forum, this analogy is used in 99% of the cases to generate AI alignment failure mode stories, whereas I am much more interested in using it to generate useful ideas about AI safety mechanisms. You may be interested in my recent paper 'demanding and designing', just announced here [], where I show how to do the useful idea generating thing. I transfer some insights about aligning powerful governments and companies to the problem of aligning powerful AI.

Would you say Learning to Summarize is an example of this?

It's model based RL because you're optimizing against the model of the human (ie the reward model). And  there are some results at the end on test-time search.

Or do you have something else in mind?

Thanks, this is very insightful. BTW, I think your paper is excellent!

0Ankesh Anand2y
Thanks, glad you liked it, I really like the recent RL directions from OpenAI too! It would be interesting to see the use of model-based RL for the "RL as fine-tuning paradigm": making large pre-trained models more aligned/goal-directed efficiently by simply searching over a reward function learned from humans. 

I'm still not sure how to reconcile your results with the fact that the participants in the procgen contest ended up winning with modifications of our PPO/PPG baselines, rather than Q-learning and other value-based algorithms, whereas your paper suggests that Q-learning performs much better. The contest used 8M timesteps + 200 levels. I assume that your "QL" baseline is pretty similar to widespread DQN implementations. (read more)

5Ankesh Anand2y
The Q-Learning baseline is a model-free control of MuZero. So it shares implementation details of MuZero (network architecture, replay ratio, training details etc.) while removing the model-based components of MuZero (details in sec A.2) . Some key differences you'd find vs a typical Q-learning implementation:   * Larger network architectures: 10 block ResNet compared to a few conv layers in typical implementations. * Higher sample reuse: When using a reanalyse ratio of 0.95, both MuZero and Q-Learning use each replay buffer sample an average of 20 times. The target network is updated every 100 training steps. * Batch size of 1024 and some smaller details like using categorical reward and value predictions similar to MuZero. * We also have a small model-based component which predicts reward at next time step which lets us decompose the Q(s,a) into reward and value predictions just like MuZero. I would guess larger networks + higher sample reuse have the biggest effect size compared to standard Q-learning implementations.  The ProcGen competition also might have used the easy difficulty mode compared to the hard difficulty mode used in our paper.

There's no PPO/PPG curve there -- I'd be curious to see that comparison. (though I agree that QL/MuZero will probably be more sample efficient.)

Performance is mostly limited here by the fact that there are 500 levels for each game (i.e., level overfitting is the problem) so it's not that meaningful to look at sample efficiency wrt environment interactions. The results would look a lot different on the full distribution of levels. I agree with your statement directionally though.

6Ankesh Anand2y
We do actually train/evaluate on the full distribution (See Figure 5 rightmost). MuZero+SSL versions (especially reconstruction) continue to be a lot more sample-efficient even in the full-distribution, and MuZero itself seems to be quite a bit more sample efficient than PPO/PPG. 

Agree with what you've written here -- I think you put it very well.

In my experience, you need separate teams doing safety research because specialization is useful -- it's easiest to make progress when both individuals and teams specialize a bit and develop taste and mastery of a narrow range of topics.

Yeah that's also good point, though I don't want to read too much into it, since it might be a historical accident.

Basically agree -- I think that a model trained by maximum likelihood on offline data is less goal-directed than one that's trained by an iterative process where you reinforce its own samples (aka online RL), but still somewhat goal directed. It needs to simulate a goal-directed agent to do a good job at maximum likelihood. OTOH it's mostly concerned with covering all possibilities, so the goal directed reasoning isn't emphasized. But with multiple iterations, the model can improve quality (-> more goal directedness) at the expense of coverage/diversity.

Super clear and actionable -- my new favorite post on AF.

I also agree with it, and it's similar to what we're doing at OpenAI (largely thanks to Paul's influence).

D'oh, re: the optimum of the objective, I now see that the solution is nontrivial. Here's my current understanding.

Intuitively, the MAP version of the objective says: find me a simple model theta1 such that there's more-complex theta2 with high likelihood under p(theta2|theta1) (which corresponds to sampling theta2 near theta1 until theta2 satisfies head-agreement condition) and high data-likelihood p(data|theta2). 

And this connects to the previous argument about world models and language as follows: we want theta1 to contain half a world model, and w... (read more)

Isn't the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior? Also, I don't see what this objective has to do with learning a world model.

3Paul Christiano2y
The idea is to address a particular reason that your learned model would "copy a human" rather than "try to answer the question well." Namely, the model already contains human-predictors, so building extra machinery to answer questions (basically translating between the world model and natural language) would be more inefficient than just using the existing human predictor. The hope is that this alternative loss allows you to use the translation machinery to compress the humans, so that it's not disfavored by the prior. I don't think it's intrinsically related to learning a world model, it's just an attempt to fix a particular problem. To the extent that there is a problem with the proposed approach---either a reason that this isn't a real problem in the standard approach, or a reason that this proposed approach couldn't address the problem (or would inevitably introduce some other problem)---then I'm interested in that. Why would it be maximized there? Isn't it at least better to make θ1=θ2+θ02? And then in the section I'm trying to argue that the final term (the partition function) in the loss means that you can potentially get a lower loss by having θ1 push apart the two heads in such a way that improving the quality of the model pushes them back together. I'm interested in anything that seems wrong in that argument. (I don't particularly believe this particular formulation is going to work, e.g. because the L2 regularizer pushes θ₁ to adjust each parameter halfway, while the intuitive argument kind of relies on it being arbitrary what you put into θ₁ or θ₂, as it would be under something more like an L1 regularizer. But I'm pretty interested in this general approach.) Two caveats were: (i) this isn't going to actually end up actually making any alternative models lower loss, it's just going to level the playing field such that a bunch of potential models have similar loss (rather than an inductive bias in favor of the bad models), (ii) in order for that to

I think this is a good idea. If you go ahead with it, here's a suggestion.

Reviewers often procrastinate for weeks or months. This is partly because doing a review takes an unbounded amount of time, especially for articles that are long or confusing. So instead of sending the reviewers a manuscript with a due date, book a calendar event for 2 hours with the reviewers. The reviewers join a call or group chat and read the paper and discuss it. They can also help clear each other's confusions. They aim to complete the review by the end of the time window.

1Adam Shimi2y
It's a pretty nice idea. I thought about just giving people two weeks, which might be a bit hardcore.

There's a decent amount of literature on using multiple rewards, though often it's framed as learning about multiple goals. Here are some off the top of my head:

The Horde (classic):
Universal Value Function Approximators:
Learning to Act By Predicting:
Temporal Difference Models:
Successor Features:
&nbs... (read more)

7Rohin Shah2y
+1, was going to comment something similar. You probably want to look at successor features in particular (for which you should run a full search + follow the citations, there are multiple papers); that's exactly the thing where you only have a multidimensional value function but not multidimensional policy learning. Successor Features for Transfer in Reinforcement Learning (the paper John linked) is specifically addressing your motivation 2; I wouldn't be surprised if some follow up paper (or even that paper) addresses motivation 1 as well. Most other papers (including Universal Value Function Approximators) are trying to learn policies that can accomplish multiple different goals, so aren't as relevant.

The results in Neural Networks Are Fundamentally Bayesian are pretty cool -- that's clever how they were able to estimate the densities.

A couple thoughts on the limitations:

  • There are various priors over functions for which we can calculate the exact posterior. (E.g., Gaussian processes.) However, doing Bayesian inference on these priors doesn't perform as well as neural networks on most datasets. So knowing SGD is Bayesian is only interesting if we also know that the prior is interesting. I think the ideal theoretical result would be to show that SGD on ne
... (read more)
3Joar Skalse2y
We do have empirical data which shows that the neural network "prior" is biased towards low-complexity functions, and some arguments for why it would make sense to expect this to be the case -- see this new blog post [], and my comment here []. Essentially, low-complexity functions correspond to larger volumes in the parameter-space of neural networks. If functions with large volumes also have large basins of attraction, and if using SGD is roughly equivalent to going down a random basin (weighted by its size), then this would essentially explain why neural networks work. I haven't seen the paper you link, so I can't comment on it specifically, but I want to note that the claim "SGD is roughly Bayesian" does not imply "Bayesian inference would give better generalisation than SGD". It can simultaneously be the case that the neural network "prior" is biased towards low-complexity functions, that SGD roughly follows the "prior", and that SGD provides some additional bias towards low-complexity functions (over and above what is provided by the "prior"). For example, if you look at Figure 6 in the post I link, you can see that different versions of SGD do provide a slightly different inductive bias. However, this effect seems to be quite small relative to what is provided by the "prior".

I might be missing some context here, but I didn't understand the section "No Indescribable Hellworlds Hypothesis" and how hellworlds have to do with debate.

3Vladimir Mikulik2y
Not Abram, and I have only skimmed the post so far, and maybe you're pointing to something more subtle, but my understanding is this: In Stuart's original use, 'No Indescribable Hellwords' is the hypothesis that in any possible world in which a human's values are violated, the violation is describable: one can point out to the human how her values are violated by the state of affairs. Analogously, debate as an approach to alignment could be seen as predicated on a similar hypothesis: that in any possible flawed argument, the flaw is describable: one can point out to a human how the argument is flawed. Edited to add: The additional claim in the Hellwords section is that acting according to the recommendations of debate won't lead to very bad outcomes -- at least, not to ones which could be pointed out. For example, we can imagine a debate around the question "Should we enact policy X?". A very strong argument, if it can be credibly argued, is "Enacting policy X leads to an unacceptable violation Y of your values down the line". So, debate will only recommend policy X if no such arguments are available. I'm not sure to what extent I buy this additional claim. For example, if when a system trained via debate is actually deployed it doesn't get asked questions like 'Should we enact policy X?' but instead more specific things like 'How much does policy X improve Y metric'?, then unless debaters are incentivised to challenge the question's premises ("The Y metric would improve, but you should consider also the unacceptable effect on Z"), we could use debate and still get hellworlds.

OK, I guess I'm a bit unclear on the problem setup and how it involves a training phase and deployment phase.

2Beth Barnes2y
I just mean that this method takes order(length of argument in judge-understandable language) time. So if the argument is large then you're going to need to let the debate run for a long time. This is as opposed to the previous hope that even if the argument tree is exp-sized, the debate can run in p-time

Wonderful writeup! 

I'm sure you've thought about this, but I'm curious why the following approach fails. Suppose we require the debaters to each initially write up a detailed argument in judge-understandable language and read each other's argument. Then, during the debate, each debater is allowed to quote short passages from their opponent's writeup. Honest will be able to either find a contradiction or an unsupported statement in Dishonest's initial writeup. If Honest quotes a passage and says its unsupported, then dishonest has to respond with the supporting sentences.

2Beth Barnes2y
To be clear, I think this is a good suggestion and is close to how I imagine we'd actually run debate in practice. It just doesn't get us beyond MA if the debaters only write P-size arguments.
3Beth Barnes2y
Thanks! Yep, this does work, but limits us to questions where the argument in judge-understandable language is short enough that the debaters can write the whole thing down. So if the debaters run in P-time at deployment time, this gives us MA, not PSPACE as originally hoped.