AI ALIGNMENT FORUM
Petrov Day
AF

AI Alignment Posts

Popular Comments

If anyone builds it, everyone will plausibly be fine

I think both stories for optimism are responded to on pages 188-191, and I don't see how you're responding to their response. It also seems to me like... step 1 of solution 1 assumes you already have a solution to alignment? You acknowledge this in the beginning of solution 2, but. I feel like there's something going wrong on a meta-level, here? > I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.” Unfortunately, I think it is obvious (that it is extremely difficult). The underlying dynamics of the situation push away from instruction following, in several different ways. 1. It is challenging to reward based on deeper dynamics instead of surface dynamics. RL is only as good as the reward signal, and without already knowing what behavior is 'aligned' or not, developers will not be able to push models towards doing more aligned behavior. 1. Do you remember the early RLHF result where the simulated hand pretended it was holding the ball with an optical illusion, because it was easier and the human graders couldn't tell the difference? Imagine that, but for arguments for whether or not alignment plans will work. 2. Goal-directed agency unlocks capabilities and pushes against corrigibility, using the same mechanisms. 1. This is the story that EY&NS deploy more frequently, because it has more 'easy call' nature. Decision theory is pretty predictable. 3. Instruction / oversight-based systems depend on a sharp overseer--the very thing we're positing we don't have. So I think your two solutions are basically the same solution ('assume you know the answer, then it is obvious') and they strike me more as 'denying that the problem exists' than facing the problem and actually solving it?

Daniel Kokotajlo20d80

AIs will greatly change engineering in AI companies well before AGI

I appreciate your recent anti-super-short timelines posts Ryan and basically agree with them. I'm curious who you see yourself as arguing against. Maybe me? But I haven't had 2027 timelines since last year, now I'm at 2029.

Rohin Shah25d2423

Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

I agree with this directionally, and generally tend to believe that certain kinds of straight lines will continue to be straight. But also iirc you rely a lot on the METR trend line, which is (a) noisy, (b) over a short timescale (~2 years[1]), and (c) not a great measure of what people care about[2][3]. I don't think you should strongly expect that particular line to continue to be straight over a long time period. (Though I tend to think that it's more likely to slow down by 2027 than to speed up, relative to what you'd predict by extrapolating the 2024-2025 trend, so I agree with the overall conclusion of the post. And obviously one should have non-trivial probability on each of {slow down, stay the same, speed up}, as you say.) 1. ^ I'm excluding GPT-2 and GPT-3 which seem very noisy. 2. ^ Compared to things like Moore's Law, or energy efficiency of solar, or pretraining perplexity, or other similar things where we observe straight lines. 3. ^ None of this is to say that the METR graph is bad! It is the best empirical evidence on timelines I'm aware of.

Recent Discussion

Beth Barnes's Shortform

Beth Barnes

7Beth Barnes3d

FYI: METR is actively fundraising! METR is a non-profit research organization. We prioritise independence and trustworthiness, which shapes both our research process and our funding options. To date, we have not accepted payment from frontier AI labs for running evaluations.[1] Part of METR's role is to independently assess the arguments that frontier AI labs put forward about the safety of their models. These arguments are becoming increasingly complex and dependent on nuances of how models are trained and how mitigations were developed. For this reason, it's important that METR has its finger on the pulse of frontier AI safety research. This means hiring and paying for staff that might otherwise work at frontier AI labs, requiring us to compete with labs directly for talent. The central constraint to our publishing more and better research, and scaling up our work aimed at monitoring the AI industry for catastrophic risk, is growing our team with excellent new researchers and engineers. And our recruiting is, to some degree, constrained by our fundraising - especially given the skyrocketing comp that AI companies are offering. To donate to METR, click here: https://metr.org/donate If you’d like to discuss giving with us first, or receive more information about our work for the purpose of informing a donation, reach out to giving@metr.org 1. ^ However, we are definitely not immune from conflicting incentives. Some examples: - We are open to taking donations from individual lab employees (subject to some constraints, e.g. excluding senior decision-makers, constituting <50% of our funding) - Labs provide us with free model access for conducting our evaluations, and several labs also provide us ongoing free access for research even if we're not conducting a specific evaluation.

8Neel Nanda3d

Can you say anything about what METR's annual budget/runway is? Given that you raised $17mn a year ago, I would have expected METR to be well funded

3Beth Barnes22h

Budget: We run at ~$13m p.a. rn (~$15m for the next year under modest growth assumptions, quite plausibly $17m++ given the increasingly insane ML job market). Audacious funding: This ended up being a bit under $16m, and is a commitment across 3 years. Runway: Depending on spend/growth assumptions, we have between 12 and 16 months of runway. We want to grow at the higher rate, but we might end up bottlenecked on senior hiring. (But that’s potentially a problem you can spend money to solve - and it also helps to be able to say "we have funding security and we have budget for you to build out a new team"). More context on our thinking: The audacious funding was a one-off, and we need to make sure we have a sustainable funding model. My sense is that for “normal” nonprofits, raising >$10m/yr is considered a big lift that would involve multiple FT fundraisers and a large fraction of org leadership’s time, and even then not necessarily succeed. We have the hypothesis that the AI safety ecosystem can support this level of funding (and more specifically, that funding availability will scale up in parallel with the growth of AI sector in general), but we want to get some evidence that that’s right and build up reasonable runway before we bet too aggressively on it. Our fundraising goal for the end of 2025 is to raise $10M

Neel Nanda17h20

That seems reasonable, thanks a lot for all the detail and context!

Thoughts on responsible scaling policies and regulation

103

paulfchristiano

I am excited about AI developers implementing responsible scaling policies; I’ve recently been spending time refining this idea and advocating for it. Most people I talk to are excited about RSPs, but there is also some uncertainty and pushback about how they relate to regulation. In this post I’ll explain my views on that:

I think that sufficiently good responsible scaling policies could dramatically reduce risk, and that preliminary policies like Anthropic’s RSP meaningfully reduce risk by creating urgency around key protective measures and increasing the probability of a pause if those measures can’t be implemented quickly enough.
I don’t think voluntary implementation of responsible scaling policies is a substitute for regulation. Voluntary commitments are unlikely to be universally adopted or to have adequate oversight, and I think the public should

...

(Continue Reading - 1680 more words)

Chris van Merwijk2d10

I think forcing people to publicly endorse policies that they don't endorse in practice just because they would solve the problem in theory is not a recipe for policy success.

I know this was said in a different context, but:

The request from people like Yudkowsky, Soares, PauseAI, etc is not that people should publicly endorse the policy despite not endorsing it in practice.

Their request is that they shouldn't be held back from saying so only because they think the policy is unlikely to happen.

There's a difference between
(1) I don't support a pa... (read more)

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd

12d

This is a linkpost for antischeming.ai

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of...

(See More - 259 more words)

Tomek Korbak3d10

Do you have any hypotheses how o3 learned what “OpenAI autop grader” is? Has this term somehow leaked to its pretraining data? (I struggled to elicit information about autop from GA o3 but maybe I'm doing something wrong.) Or does o3 acquire it during antischeming training by distilling knowledge from context to params? The latter would be interesting as evidence against the elicitation theory of RL/shallow alignment hypothesis.

1Tomek Korbak3d

Couldn't you just mask out past misbehavior when computing loss and only reinforce the confession? Concretely, your RL prompt would involve <context, rule_violation>, you'd sometimes sample a confession your gradient update would be R(confession)∇θlogπθ(confession|context,rule_violation). Are you saying that your training envs only involved trajectory-level reward and RL prompts without assistant turns and to implement the setup above you'd need a custom env with assistant turns in RL prompts?

1Tomek Korbak3d

How much of the covert action rate reduction is due to SFT vs further RL? Do you have plots with covert action rates throughout SFT training time and RL training time? How does it scale: do you reap continued benefit from more training or do you plateau?

1Tomek Korbak3d

Very impressive paper! Here's a question. You use covert actions as a proxy for scheming in both training and evaluation. How OOD would this be for real scheming in near-future models? Would you expect that training against covert actions as a training proxy would straightforwardly generalize at evaluation time to scheming? If not, then: 1. What adaptation would your training method require to cover real scheming? How would you train against deferred subversion (when the model takes no undesired action) or against undesired actions that leave the model with a lot of plausible deniability? Would you just amend your spec? Would you need more precise filtering of SFT data? Would you need to change the reward function anyhow (e.g. have it rely on counterfactual rollouts)? Would you need to control for situational awareness more strictly? 2. Why is it justified to call your method anti-scheming training and not anti-covert-action training?

Zach Stein-Perlman's Shortform

Zach Stein-Perlman

Drake Thomas4d30

security during takeoff is crucial (probably, depending on how exactly the nonproliferation works)

I think you're already tracking this but to spell out a dynamic here a bit more: if the US maintains control over what runs on its datacenters and has substantially more compute on one project than any other actor, then it might still be OK for adversaries to have total visibility into your model weights and everything else you do: you just work on a mix of AI R&D and defensive security research with your compute (at a faster rate than they can work on RSI... (read more)

6Vladimir_Nesov4d

Those AIs might be able to lobby for nonproliferation or do things like write a better IABIED, making coordination interventions that oppose myopic racing. Directing AIs to pursue such projects could be a priority comparable to direct alignment work. Unclear how visibly asymmetric such interventions will prove to be, but then alignment vs. capabilities work might be in a similar situation.

1Zach Stein-Perlman4d

Yep, this is what I meant by "labs can increase US willingness to pay for nonproliferation." Edited to clarify.

2Vladimir_Nesov4d

My point is that the 10-30x AIs might be able to be more effective at coordination around AI risk than humans alone, in particular more effective than currently seems feasible in the relevant timeframe (when not taking into account the use of those 10-30x AIs). Saying "labs" doesn't make this distinction explicit.

Research Agenda: Synthesizing Standalone World-Models

Thane Ruthenis

tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda's subproblems and my sketches of how to tackle them.

Back at the end of 2023, I wrote the following:

I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)

On the inside view, I'm pretty satisfied with how that is turning out. I have a high-level plan of attack which approaches the problem from a novel route, and which hopefully lets us dodge a...

(Continue Reading - 3241 more words)

5Steven Byrnes5d

I guess the main blockers I see are: * I think you need to build in agency in order to get a good world-model (or at least, a better-than-LLM world model). * There are effectively infinitely many things about the world that one could figure out. If I cared about wrinkly shirts, I could figure out vastly more than any human has ever known about wrinkly shirts. I could find mathematical theorems in the patterns of wrinkles. I could theorize and/or run experiments on whether the wrinkliness of a wool shirt relates to the sheep’s diet. Etc. * …Or if we’re talking about e.g. possible inventions that don’t exist yet, then the combinatorial explosion of possibilities gets even worse. * I think the only solution is: an agent that cares about something or wants something, and then that wanting / caring creates value-of-information which in turn guides what to think about / pay attention to / study. * What’s the pivotal act? * Depending on what you have in mind here, the previous bullet point might be inapplicable or different, and I might or might not have other complaints too. You can DM or email me if you want to discuss but not publicly :) It’s funny that I’m always begging people to stop trying to reverse-engineer the neocortex, and you’re working on something that (if successful) would end up somewhere pretty similar to that, IIUC. (But hmm, I guess if a paranoid doom-pilled person was trying to reverse-engineer the neocortex, and keep the results super-secret unless they had a great theory for how sharing them would help with safe & beneficial AGI, and if they in fact had good judgment on that topic, then I guess I’d be grudgingly OK with that.)

Thane Ruthenis5d30

There are effectively infinitely many things about the world that one could figure out

One way to control that is to control the training data. We don't necessarily have to point the wm-synthesizer at the Pile indiscriminately,^[1] we could assemble a dataset about a specific phenomenon we want to comprehend.

if we’re talking about e.g. possible inventions that don’t exist yet, then the combinatorial explosion of possibilities gets even worse

Human world-models are lazy: they store knowledge in the maximally "decomposed" form^[2], and only synthesize speci... (read more)