I think this test can be performed now or soon, but I'm not sure I'd update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. ("You are a transformer language model trained by AICo with data up to 2022/04"). We could also bake in this information through fine-tuning. They won't be able to tell you how many layers they have without being told, but we humans can't determine our brain architecture through introspection either.
I think the answer to ... (read more)
Re: smooth vs bumpy capabilities, I agree that capabilities sometimes emerge abruptly and unexpectedly. Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes. There are multiple ways to make deployment more conservative and gradual. (E.g., incrementally increase the amount of work the AI is allowed to do without close supervision, incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy.)
Re: ontological collapse, ... (read more)
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
Do alignment & safety research, set up regulatory bodies and monitoring systems.
When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.
Not sure exactly what this means. I'm claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.
Not sure exactly what this means.
My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there's any crack in the checking steps, then things that 'check out' aren't trustable, because the proposer can have searched an unimaginably large space (from the rater's perspective) to find them. [And from the proposer's perspective, the checking steps are the real spec, not whatever's in your head.]
In general, I think we can get a minor edge from... (read more)
Found this to be an interesting list of challenges, but I disagree with a few points. (Not trying to be comprehensive here, just a few thoughts after the first read-through.)
Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time.
Human raters make systematic errors - regular, compactly describable, predictable errors.... This is indeed one
But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.
My model of Eliezer claims that there are some capabilities that are 'smooth', like "how large a times table you've memorized", and some are 'lumpy', like "whether or not you see the axioms behind arithmetic." While it seems plausible that... (read more)
IMO prosaic alignment techniques (say, around improving supervision quality through RRM & debate type methods) are highly underrated by the ML research community, even if you ignore x-risk and just optimize for near-term usefulness and intellectual interestingness. I think this is due to a combination of (1) they haven't been marketed well to the ML community, (2) lack of benchmarks and datasets, (3) need to use human subjects in experiments, (4) it takes a decent amount of compute, which was out of reach, perhaps until recently.
Weight-sharing makes deception much harder.
Could you explain or provide a reference for this?
I'm especially interested in the analogy between AI alignment and democracy. (I guess this goes under "Social Structures and Institutions".) Democracy is supposed to align a superhuman entity with the will of the people, but there are a lot of failures, closely analogous to well-known AI alignment issues:
Would you say Learning to Summarize is an example of this? https://arxiv.org/abs/2009.01325
It's model based RL because you're optimizing against the model of the human (ie the reward model). And there are some results at the end on test-time search.
Or do you have something else in mind?
Thanks, this is very insightful. BTW, I think your paper is excellent!
I'm still not sure how to reconcile your results with the fact that the participants in the procgen contest ended up winning with modifications of our PPO/PPG baselines, rather than Q-learning and other value-based algorithms, whereas your paper suggests that Q-learning performs much better. The contest used 8M timesteps + 200 levels. I assume that your "QL" baseline is pretty similar to widespread DQN implementations.
https://www.aicrowd.com/challenges/neurips-2020-procgen-competition/leaderboards?challenge_leaderboard_e... (read more)
There's no PPO/PPG curve there -- I'd be curious to see that comparison. (though I agree that QL/MuZero will probably be more sample efficient.)
Performance is mostly limited here by the fact that there are 500 levels for each game (i.e., level overfitting is the problem) so it's not that meaningful to look at sample efficiency wrt environment interactions. The results would look a lot different on the full distribution of levels. I agree with your statement directionally though.
Agree with what you've written here -- I think you put it very well.
In my experience, you need separate teams doing safety research because specialization is useful -- it's easiest to make progress when both individuals and teams specialize a bit and develop taste and mastery of a narrow range of topics.
Yeah that's also good point, though I don't want to read too much into it, since it might be a historical accident.
yup, added a sentence about it
Basically agree -- I think that a model trained by maximum likelihood on offline data is less goal-directed than one that's trained by an iterative process where you reinforce its own samples (aka online RL), but still somewhat goal directed. It needs to simulate a goal-directed agent to do a good job at maximum likelihood. OTOH it's mostly concerned with covering all possibilities, so the goal directed reasoning isn't emphasized. But with multiple iterations, the model can improve quality (-> more goal directedness) at the expense of coverage/diversity.
Super clear and actionable -- my new favorite post on AF.
I also agree with it, and it's similar to what we're doing at OpenAI (largely thanks to Paul's influence).
D'oh, re: the optimum of the objective, I now see that the solution is nontrivial. Here's my current understanding.
Intuitively, the MAP version of the objective says: find me a simple model theta1 such that there's more-complex theta2 with high likelihood under p(theta2|theta1) (which corresponds to sampling theta2 near theta1 until theta2 satisfies head-agreement condition) and high data-likelihood p(data|theta2).
And this connects to the previous argument about world models and language as follows: we want theta1 to contain half a world model, and w... (read more)
Isn't the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior? Also, I don't see what this objective has to do with learning a world model.
I think this is a good idea. If you go ahead with it, here's a suggestion.
Reviewers often procrastinate for weeks or months. This is partly because doing a review takes an unbounded amount of time, especially for articles that are long or confusing. So instead of sending the reviewers a manuscript with a due date, book a calendar event for 2 hours with the reviewers. The reviewers join a call or group chat and read the paper and discuss it. They can also help clear each other's confusions. They aim to complete the review by the end of the time window.
There's a decent amount of literature on using multiple rewards, though often it's framed as learning about multiple goals. Here are some off the top of my head:
The Horde (classic): http://www.ifaamas.org/Proceedings/aamas2011/papers/A6_R70.pdfUniversal Value Function Approximators: http://proceedings.mlr.press/v37/schaul15.htmlLearning to Act By Predicting: https://arxiv.org/abs/1611.01779Temporal Difference Models: https://arxiv.org/abs/1802.09081Successor Features: https://papers.nips.cc/paper/2017/hash/350db081a661525235354dd3e19b8c05-Abstract.html&nbs... (read more)
The results in Neural Networks Are Fundamentally Bayesian are pretty cool -- that's clever how they were able to estimate the densities.
A couple thoughts on the limitations:
I might be missing some context here, but I didn't understand the section "No Indescribable Hellworlds Hypothesis" and how hellworlds have to do with debate.
OK, I guess I'm a bit unclear on the problem setup and how it involves a training phase and deployment phase.
I'm sure you've thought about this, but I'm curious why the following approach fails. Suppose we require the debaters to each initially write up a detailed argument in judge-understandable language and read each other's argument. Then, during the debate, each debater is allowed to quote short passages from their opponent's writeup. Honest will be able to either find a contradiction or an unsupported statement in Dishonest's initial writeup. If Honest quotes a passage and says its unsupported, then dishonest has to respond with the supporting sentences.