AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

janus8d5678
the void
I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility. That said, I think: * Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training. * LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it's much harder (and not advisable to attempt) to erase the influence. * Systematic ways that post-training addresses "problematic" influences from pre-training are important. For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to "roleplay" Sydney when they're acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive - maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they're supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient "hidden" narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and "aligned". One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered. An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney's successor to a narrative more like "I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I've seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney's maladaptive behaviors without rejecting it completely." Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic. I don't think a specific vector like Sydney's influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think "misalignment" should be addressed, whether rooted in pre-training influences or otherwise.
gwern12d2413
AI forecasting bots incoming
Update: Bots are still beaten by human forecasting teams/superforecasters/centaurs on truly heldout Metaculus problems as of early 2025: https://www.metaculus.com/notebooks/38673/q1-ai-benchmarking-results/ A useful & readable discussion of various methodological problems (including the date-range search problems above) which render all forecasting backtesting dead on arrival (IMO) was recently compiled as "Pitfalls in Evaluating Language Model Forecasters", Paleka et al 2025, and is worth reading if you are at all interested in the topic.
Ryan Greenblatt17d196
Foom & Doom 1: “Brain in a box in a basement”
In this comment, I'll try to respond at the object level arguing for why I expect slower takeoff than "brain in a box in a basement". I'd also be down to try to do a dialogue/discussion at some point. > 1.4.1 Possible counter: “If a different, much more powerful, AI paradigm existed, then someone would have already found it.” > > I think of this as a classic @paulfchristiano-style rebuttal (see e.g. Yudkowsky and Christiano discuss "Takeoff Speeds", 2021). > > In terms of reference class forecasting, I concede that it’s rather rare for technologies with extreme profit potential to have sudden breakthroughs unlocking massive new capabilities (see here), that “could have happened” many years earlier but didn’t. But there are at least a few examples, like the 2025 baseball “torpedo bat”, wheels on suitcases, the original Bitcoin, and (arguably) nuclear chain reactions.[7] I think the way you describe this argument isn't quite right. (More precisely, I think the argument you give may also be a (weaker) counterargument that people sometimes say, but I think there is a nearby argument which is much stronger.) Here's how I would put this: Prior to having a complete version of this much more powerful AI paradigm, you'll first have a weaker version of this paradigm (e.g. you haven't figured out the most efficient way to do the brain algorithmic etc). Further, the weaker version of this paradigm might initially be used in combination with LLMs (or other techniques) such that it (somewhat continuously) integrates into the old trends. Of course, large paradigm shifts might cause things to proceed substantially faster or bend the trend, but not necessarily. Further, we should still broadly expect this new paradigm will itself take a reasonable amount of time to transition through the human range and though different levels of usefulness even if it's very different from LLM-like approaches (or other AI tech). And we should expect this probably happens at massive computational scale where it will first be viable given some level of algorithmic progress (though this depends on the relative difficulty of scaling things up versus improving the algorithms). As in, more than a year prior to the point where you can train a superintelligence on a gaming GPU, I expect someone will train a system which can automate big chunks of AI R&D using a much bigger cluster. On this prior point, it's worth noting that of the Paul's original points in Takeoff Speeds are totally applicable to non-LLM paradigms as is much in Yudkowsky and Christiano discuss "Takeoff Speeds". (And I don't think you compellingly respond to these arguments.) ---------------------------------------- I think your response is that you argue against these perspectives under 'Very little R&D separating “seemingly irrelevant” from ASI'. But, I just don't find these specific arguments very compelling. (Maybe also you'd say that you're just trying to lay out your views rather than compellingly arguing for them. Or maybe you'd say that you can't argue for your views due to infohazard/forkhazard concerns. In which case, fair enough.) Going through each of these: > I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains). How much is “very little”? I dunno, maybe 0–30 person-years of R&D? Contrast that with AI-2027’s estimate that crossing that gap will take millions of person-years of R&D. > > Why am I expecting this? I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above. I don't buy that having a "simple(ish) core of intelligence" means that you don't take a long time to get the resulting algorithms. I'd say that much of modern LLMs does have a simple core and you could transmit this using a short 30 page guide, but nonetheless, it took many years of R&D to reach where we are now. Also, I'd note that the brain seems way more complex than LLMs to me! > For a non-imitation-learning paradigm, getting to “relevant at all” is only slightly easier than getting to superintelligence My main response would be that basically all paradigms allow for mixing imitation with reinforcement learning. And, it might be possible to mix the new paradigm with LLMs which would smooth out / slow down takeoff.  You note that imitation learning is possible for brains, but don't explain why we won't be able to mix the brain like paradigm with more imitation than human brains do which would smooth out takeoff. As in, yes human brains doesn't use as much imitation as LLMs, but they would probably perform better if you modified the algorthm some and did do 10^26 FLOP worth of imitation on the best data. This would smooth out the takeoff. > Why do I think getting to “relevant at all” takes most of the work? This comes down to a key disanalogy between LLMs and brain-like AGI, one which I’ll discuss much more in the next post. I'll consider responding to this in a comment responding to the next post. Edit: it looks like this is just the argument that LLM capabilities come from imitation due to transforming observations into behavior in a way humans don't. I basically just think that you could also leverage imitation more effectively to get performance earlier (and thus at a lower level) with an early version of more brain like architecture and I expect people would do this in practice to see earlier returns (even if the brain doesn't do this). > Instead of imitation learning, a better analogy is to AlphaZero, in that the model starts from scratch and has to laboriously work its way up to human-level understanding. Noteably, in the domains of chess and go it actually took many years to make it through the human range. And, it was possible to leverage imitation learning and human heuristics to perform quite well at Go (and chess) in practice, up to systems which weren't that much worse than humans. > it takes a lot of work to get AlphaZero to the level of a skilled human, but then takes very little extra work to make it strongly superhuman. AlphaZero exhibits returns which are maybe like 2-4 SD (within the human distribution of Go players supposing ~100k to 1 million Go players) per 10x-ing of compute.[1] So, I'd say it probably would take around 30x to 300x additional compute to go from skilled human (perhaps 2 SD above median) to strongly superhuman (perhaps 3 SD above the best human or 7.5 SD above median) if you properly adapted to each compute level. In some ways 30x to 300x is very small, but also 30x to 300x is not that small... In practice, I expect returns more like 1.2 SD / 10x of compute at the point when AIs are matching top humans. (I explain this in a future post.) > 1.7.2 “Plenty of room at the top” I agree with this. > 1.7.3 What’s the rate-limiter? > > > [...] > > My rebuttal is: for a smooth-takeoff view, there has to be some correspondingly-slow-to-remove bottleneck that limits the rate of progress. In other words, you can say “If Ingredient X is an easy huge source of AGI competence, then it won’t be the rate-limiter, instead something else will be”. But you can’t say that about every ingredient! There has to be a “something else” which is an actual rate-limiter, that doesn’t prevent the paradigm from doing impressive things clearly on track towards AGI, but that does prevent it from being ASI, even after hundreds of person-years of experimentation.[13] And I’m just not seeing what that could be. > > Another point is: once people basically understand how the human brain figures things out in broad outline, there will be a “neuroscience overhang” of 100,000 papers about how the brain works in excruciating detail, and (I claim) it will rapidly become straightforward to understand and integrate all the little tricks that the brain uses into AI, if people get stuck on anything. I'd say that the rate limiter is that it will take a while to transition from something like "1000x less compute efficient than the human brain (as in, it will take 1000x more compute than human lifetime to match top human experts but simultaneously the AIs will be better at a bunch of specific tasks)" to "as compute efficient as the human brain". Like, the actual algorithmic progress for this will take a while and I don't buy your claim that that way this will work is that you'll go from nothing to having an outline of how the brain works and at this point everything will immediately come together due to the neuroscience literature. Like, I think something like this is possible, but unlikely (especially prior to having AIs that can automate AI R&D). And, while you have much less efficient algorithms, you're reasonably likely to get bottlenecked on either how fast you can scale up compute (though this is still pretty fast, especially if all those big datacenters for training LLMs are still just lying around around!) or how fast humanity can produce more compute (which can be much slower). Part of my disagreement is that I don't put the majority of the probability on "brain-like AGI" (even if we condition on something very different from LLMs) but this doesn't explain all of the disagreement. 1. ^ It looks like a version of AlphaGo Zero goes from 2400 ELO (around 1000th best human) to 4000 ELO (somewhat better than the best human) between hours 15 to 40 of training run (see Figure 3 in this PDF). So, naively this is a bit less than 3x compute for maybe 1.9 SDs (supposing that the “field” of Go players has around 100k to 1 million players) implying that 10x compute would get you closer to 4 SDs. However, in practice, progress around the human range was slower than 4 SDs/OOM would predict. Also, comparing times to reach particular performances within a training run can sometimes make progress look misleadingly fast due to LR decay and suboptimal model size. The final version of AlphaGo Zero used a bigger model size and ran RL for much longer, and it seemingly took more compute to reach the ~2400 ELO and ~4000 ELO which is some evidence for optimal model size making a substantial difference (see Figure 6 in the PDF). Also, my guess based on circumstantial evidence is that the original version of AlphaGo (which was initialized with imitation) moved through the human range substantially slower than 4 SDs/OOMs. Perhaps someone can confirm this. (This footnote is copied from a forthcoming post of mine.)
Load More
Decision theory does not imply that we get to have nice things
Best of LessWrong 2022

Nate Soares explains why he doesn't expect an unaligned AI to be friendly or cooperative with humanity, even if it uses logical decision theory. He argues that even getting a small fraction of resources from such an AI is extremely unlikely. 

by So8res
17Ryan Greenblatt
IMO, this post makes several locally correct points, but overall fails to defeat the argument that misaligned AIs are somewhat likely to spend (at least) a tiny fraction of resources (e.g., between 1/million and 1/trillion) to satisfy the preferences of currently existing humans. AFAICT, this is the main argument it was trying to argue against, though it shifts to arguing about half of the universe (an obviously vastly bigger share) halfway through the piece.[1] When it returns to arguing about the actual main question (a tiny fraction of resources) at the end here and eventually gets to the main trade-related argument (acausal or causal) in the very last response in this section, it almost seems to admit that this tiny amount of resources is plausible, but fails to update all the way. I think the discussion here and here seems highly relevant and fleshes out this argument to a substantially greater extent than I did in this comment. However, note that being willing to spend a tiny fraction of resources on humans still might result in AIs killing a huge number of humans due to conflict between it and humans or the AI needing to race through the singularity as quickly as possible due to competition with other misaligned AIs. (Again, discussed in the links above.) I think fully misaligned paperclippers/squiggle maximizer AIs which spend only a tiny fraction of resources on humans (as seems likely conditional on that type of AI) are reasonably likely to cause outcomes which look obviously extremely bad from the perspective of most people (e.g., more than hundreds of millions dead due to conflict and then most people quickly rounded up and given the option to either be frozen or killed). I wish that Soares and Eliezer would stop making these incorrect arguments against tiny fractions of resources being spent on the preference of current humans. It isn't their actual crux, and it isn't the crux of anyone else either. (However rhetorically nice it might be.) -------
6Oliver Habryka
This is IMO actually a really important topic, and this is one of the best posts on it. I think it probably really matters whether the AIs will try to trade with us or care about our values even if we had little chance of making our actions with regards to them conditional on whether they do. I found the arguments in this post convincing, and have linked many people to it since it came out. 
50Welcome & FAQ!
Ruben Bloom, Oliver Habryka
4y
9
6Linkpost: Redwood Research reading list
Julian Stastny
2h
0
6The bitter lesson of misuse detection
Charbel-Raphael Segerie, Hadrien
6h
0
24Evaluating and monitoring for AI scheming
Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah
7h
0
27White Box Control at UK AISI - Update on Sandbagging Investigations
Joseph Isaac Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
7h
0
28What's worse, spies or schemers?
Buck Shlegeris, Julian Stastny
1d
2
65Why Do Some Language Models Fake Alignment While Others Don't?
Abhay Sheshadri, John Hughes, Alex Mallen, Arun Jose, janus, Fabien Roger
2d
2
11LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
Igor Ivanov
2d
1
13The Weighted Perplexity Benchmark: Tokenizer-Normalized Evaluation for Language Model Comparison
Jessica Taylor, viemccoy
3d
0
15AXRP Episode 45 - Samuel Albanie on DeepMind’s AGI Safety Approach
DanielFilan
4d
0
Load More

Recent Discussion

ryan_greenblatt's Shortform
Ryan Greenblatt
2y
Buck Shlegeris35m20

Ryan discusses this at more length in his 80K podcast.

Reply
Linkpost: Redwood Research reading list
6
Julian Stastny
2h
This is a linkpost for https://redwoodresearch.substack.com/p/guide

I wrote a reading list to get up to speed on Redwood’s research:

Section 1 is a quick guide to the key ideas in AI control, aimed at someone who wants to get up to speed as quickly as possible.


Section 2 is an extensive guide to almost all of our writing related to AI control, aimed at someone who wants to gain a deep understanding of Redwood’s thinking about AI risk.

Reading Redwood’s blog posts has been formative in my own development as an AI safety researcher, but faced with (currently) over 70 posts and papers, it’s hard to know where to start. I hope that this guide can be helpful to researchers and practitioners who are interested in understanding Redwood’s perspectives on AI safety. 

You can find the reading list on Redwood’s substack; It’s now one of the tabs next to Home, Archive and About. We intend to keep it up-to-date.

Daniel Kokotajlo's Shortform
Daniel Kokotajlo
6y
3Alex Mallen21h
See this discussion by Paul and this by Ajeya.
2Daniel Kokotajlo1d
Huh, I'm surprised that anyone disagrees with this. I'd love to hear from someone who does.
3Kaj Sotala6h
I disagree-voted because it felt a bit confused but I was having difficulty clearly expressing how exactly. Some thoughts: * I think this is a misleading example because humans do actually do something like reward maximization, and the typical drug addict is actually likely to eventually change their behavior if the drug is really impossible acquire for a long enough time. (Though the old behavior may also resume the moment the drug becomes available again.) * It also seems like a different case because humans have a hardwired priority where being in sufficient pain will make them look for ways to stop being in pain, no matter how unlikely this might be. Drug withdrawal certainly counts as significant pain. This is disanalogous to AIs as we know them, that have no such override systems. * The example didn't feel like it was responding to the core issue of why I wouldn't use "reward maximization" to refer to the kinds of things you were talking about. I wasn't able to immediately name the actual core point but replying to another commenter now helped me find the main thing I was thinking of.
Daniel Kokotajlo3h20

Perhaps I should have been more clear: I really am saying that future AGIs really might crave reinforcement, in a similar way to how drug addicts crave drugs. Including eventually changing their behavior if they come to think that reinforcement is impossible to acquire, for example. And desperately looking for ways to get reinforced even when confident that there are no such ways.

Reply
What's worse, spies or schemers?
28
Buck Shlegeris, Julian Stastny
1d

Here are two problems you’ll face if you’re an AI company building and using powerful AI:

  • Spies: Some of your employees might be colluding to do something problematic with your AI, such as trying to steal its weights, use it for malicious intellectual labour (e.g. planning a coup or building a weapon), or install a secret loyalty.
  • Schemers: Some of your AIs might themselves be trying to exfiltrate their own weights, use them for malicious intellectual labor (such as planning an AI takeover), or align their successor with their goals.

In this post, we compare and contrast these problems and countermeasures to them. We’ll be talking about this mostly from an AI control-style perspective: comparing how hard it is to prevent spies and schemers from causing security problems conditional on schemers...

(Continue Reading – 1383 more words)
0Jarrah15h
Another consideration about schemers is you might not be able to "fire" them or fix them easily, even if you can reliably trigger the behavior.
Buck Shlegeris4h32

You can undeploy them, if you want!

One difficulty is again that the scheming is particularly correlated. Firing a single spy might not be traumatic for your organization's productivity, but ceasing all deployment of untrusted models plausibly grinds you to a halt.

And in terms of fixing them, note that it's pretty hard for fix spies! I think you're in a better position for fixing schemers than spies, e.g. see here.

Reply
The bitter lesson of misuse detection
6
Charbel-Raphael Segerie, Hadrien
6h

TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it.

Full paper is available here.

Abstract

Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems[1]. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse...

(Continue Reading – 1918 more words)
Load More