John Schulman

Wiki Contributions


Cool, find-tuning sounds a bit like conditional Kolmogorov complexity -- the cost of your explanation would be K(explanation of rare thing | explanation of the loss value and general functionality)

I'm not sure I fully understand how you would use surprise accounting in a practical scenario. In the circuit example, you're starting out with a statement that is known to be true (by checking all inputs) -- that the circuit always outputs 1. But in most practical scenarios, you have a hypothesis (e.g., "the model will never do $bad_thing under any input") that you're not sure is true. So where do you get the statement that you'd apply the surprise accounting to? I guess you could start out with "the model didn't do $bad_thing on 1 million randomly sampled inputs", try to find an explanation of this finding with low total surprise, and then extend the explanation into a proof that applies to all inputs. Is that right?

I think this test can be performed now or soon, but I'm not sure I'd update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. ("You are a transformer language model trained by AICo with data up to 2022/04"). We could also bake in this information through fine-tuning. They won't be able to tell you how many layers they have without being told, but we humans can't determine our brain architecture through introspection either. 

I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations. Dialog-prompted models probably fall back on literary depictions of AI for self-oriented questions they don't know how to answer, so the answer might depend on which sci-fi AI the model is role-playing. (It's harder to say what determines the OOD behavior for models trained with more sophisticated methods like RLHF.)

Re: smooth vs bumpy capabilities, I agree that capabilities sometimes emerge abruptly and unexpectedly. Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes. There are multiple ways to make deployment more conservative and gradual. (E.g., incrementally increase the amount of work the AI is allowed to do without close supervision, incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy.)

Re: ontological collapse, there are definitely some tricky issues here, but the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn't really have goals and isn't good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later?  What is the weak pivotal act that you can perform so safely?

Do alignment & safety research, set up regulatory bodies and monitoring systems.

When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.

Not sure exactly what this means. I'm claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.

Found this to be an interesting list of challenges, but I disagree with a few points. (Not trying to be comprehensive here, just a few thoughts after the first read-through.)

  • Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.
  • One claim is that Capabilities generalize further than alignment once capabilities start to generalize far. The argument is that an agent's world model and tactics will be automatically fixed by reasoning and data, but its inner objective won't be changed by these things. I agree with the preceding sentence, but I would draw a different (and more optimistic) conclusion from it. That it might be possible to establish an agent's inner objective when training on easy problems, when the agent isn't very capable, such that this objective remains stable as the agent becomes more powerful.
    Also, there's empirical evidence that alignment generalizes surprisingly well: several thousand instruction following examples radically improve the aligned behavior on a wide distribution of language tasks (InstructGPT paper) a prompt with about 20 conversations gives much better behavior on a wide variety of conversational inputs (HHH paper). Making a contemporary language model well-behaved seems to be much easier than teaching it a new cognitive skill.
  • Human raters make systematic errors - regular, compactly describable, predictable errors.... This is indeed one of the big problems of outer alignment, but there's lots of ongoing research and promising ideas for fixing it. Namely, using models to help amplify and improve the human feedback signal. Because P!=NP it's easier to verify proofs than to write them. Obviously alignment isn't about writing proofs, but the general principle does apply. You can reduce "behaving well" to "answering questions truthfully" by asking questions like "did the agent follow the instructions in this episode?", and use those to define the reward function. These questions are not formulated in formal language where verification is easy, but there's reason to believe that verification is also easier than proof-generation for informal arguments.

IMO prosaic alignment techniques (say, around improving supervision quality through RRM & debate type methods) are highly underrated by the ML research community, even if you ignore x-risk and just optimize for near-term usefulness and intellectual interestingness. I think this is due to a combination of (1) they haven't been marketed well to the ML community, (2) lack of benchmarks and datasets, (3) need to use human subjects in experiments, (4) it takes a decent amount of compute, which was out of reach, perhaps until recently.

Weight-sharing makes deception much harder.

Could you explain or provide a reference for this?

I'm especially interested in the analogy between AI alignment and democracy. (I guess this goes under "Social Structures and Institutions".) Democracy is supposed to align a superhuman entity with the will of the people, but there are a lot of failures, closely analogous to well-known AI alignment issues: 

  • politicians optimize for the approval of low-information voters, rather than truly optimizing the people's wellbeing (deceptive alignment)
  • politician, pacs, parties, permanent bureaucrats are agents with their own goals  that don't align with the populace (mesa optimizers)

I think it's more likely that insights will transfer from the field of AI alignment to the field of government design than vice versa. Easier to do experiments on the AI side, and clearer thinkers.

Would you say Learning to Summarize is an example of this?

It's model based RL because you're optimizing against the model of the human (ie the reward model). And  there are some results at the end on test-time search.

Or do you have something else in mind?

Load More