Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees. He argues Eliezer has raised many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.
That seems reasonable, thanks a lot for all the detail and context!
I am excited about AI developers implementing responsible scaling policies; I’ve recently been spending time refining this idea and advocating for it. Most people I talk to are excited about RSPs, but there is also some uncertainty and pushback about how they relate to regulation. In this post I’ll explain my views on that:
I think forcing people to publicly endorse policies that they don't endorse in practice just because they would solve the problem in theory is not a recipe for policy success.
I know this was said in a different context, but:
The request from people like Yudkowsky, Soares, PauseAI, etc is not that people should publicly endorse the policy despite not endorsing it in practice.
Their request is that they shouldn't be held back from saying so only because they think the policy is unlikely to happen.
There's a difference between
(1) I don't support a pa...
Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv
Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.
In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of...
Do you have any hypotheses how o3 learned what “OpenAI autop grader” is? Has this term somehow leaked to its pretraining data? (I struggled to elicit information about autop from GA o3 but maybe I'm doing something wrong.) Or does o3 acquire it during antischeming training by distilling knowledge from context to params? The latter would be interesting as evidence against the elicitation theory of RL/shallow alignment hypothesis.
security during takeoff is crucial (probably, depending on how exactly the nonproliferation works)
I think you're already tracking this but to spell out a dynamic here a bit more: if the US maintains control over what runs on its datacenters and has substantially more compute on one project than any other actor, then it might still be OK for adversaries to have total visibility into your model weights and everything else you do: you just work on a mix of AI R&D and defensive security research with your compute (at a faster rate than they can work on RSI...
tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda's subproblems and my sketches of how to tackle them.
Back at the end of 2023, I wrote the following:
I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)
On the inside view, I'm pretty satisfied with how that is turning out. I have a high-level plan of attack which approaches the problem from a novel route, and which hopefully lets us dodge a...
There are effectively infinitely many things about the world that one could figure out
One way to control that is to control the training data. We don't necessarily have to point the wm-synthesizer at the Pile indiscriminately,[1] we could assemble a dataset about a specific phenomenon we want to comprehend.
if we’re talking about e.g. possible inventions that don’t exist yet, then the combinatorial explosion of possibilities gets even worse
Human world-models are lazy: they store knowledge in the maximally "decomposed" form[2], and only synthesize speci...