Fabien Roger

Sequences

AI Control

Wikitag Contributions

Comments

Sorted by
  • Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
  • Mild optimization. I'm particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.

I think additional non-moonshot work in these domains will have a very hard time helping.

[low confidence] My high level concern is that non-moonshot work in these clusters may be the sort of things labs will use anyway (with or without safety push) if this helped with capabilities because the techniques are easy to find, and won't use if it didn't help with capabilities because the case for risk reduction is weak. This concern is mostly informed by my (relatively shallow) read of recent work in these clusters.

Here are things that would change my mind:

  • If I thought people were making progress towards techniques with nicer safety properties and no alignment tax that seem hard enough to make workable in practice that capabilities researchers won't bother using by default, but would bother using if there was existing work on how to make them work.
    • (For the different question of preventing AIs from using "harmful knowledge", I think work on robust unlearning and gradient routing may have this property - the current SoTA is far enough from a solution that I expect labs to not bother doing anything good enough here, but I think there is a path to legible success, and conditional on success I expect labs to pick it up because it would be obviously better, more robust, and plausibly cheaper than refusal training + monitoring. And I think robust unlearning and gradient routing have better safety properties than refusal training + monitoring.)
  • If I thought people were making progress towards understanding when not using process-based supervision and debate is risky. This looks like demos and model organisms aimed at measuring when, in real life, not using these simple precautions would result in very bad outcomes while using the simple precautions would help.
    • (In the different domain of  CoT-faithfulness I think there is a lot of value in demonstrating the risk of opaque CoT well-enough that labs don't build techniques that make CoT more opaque if it just slightly increased performance because I expect that it will be easier to justify. I think GDM's updated safety framework is a good step in this direction as it hints at additional requirements GDM may have to fulfill if it wanted to deploy models with opaque CoT past a certain level of capabilities.)
  • If I thought that research directions included in the cluster you are pointing at were making progress towards speeding up capabilities in safety-critical domains (e.g. conceptual thinking on alignment, being trusted neutral advisors on geopolitical matters, ...) relative to baseline methods (i.e. the sort of RL you would do by default if you wanted to make the model better at the safety-critical task if you had no awareness of anything people did in the safety literature).

I am not very aware of what is going on in this part of the AI safety field. It might be the case that I would change my mind if I was aware of certain existing pieces of work or certain arguments. In particular I might be too skeptical about progress on methods for things like debate and process-based supervision - I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?

It's also possible that I am missing an important theory of change for this sort of research.

(I'd still like it if more people worked on alignment: I am excited about projects that look more like the moonshots described in the RFP and less like the kind of research I think you are pointing at.)

Thank you for writing this! I think it's probably true that sth like "society's alignment to human interest implicitly relies on human labor and cognition" is correct and that we will have to find clever solutions, lots of resources and political will to maintain alignment if human labor and cognition stops playing a large role. I am glad some people are thinking about these risks.

While I think the essay describes dynamics which I think are likely to result in a scary power concentration, I think this is more likely to be a power concentration for humans or more straightforwardly misaligned AIs rather than some notion of complete disempowerment. I'd be excited about a follow-up work which focuses on the argument for power concentration, which seems more likely to be robust and accurate to me.

 

Some criticism on complete disempowerment (that goes beyond power concentration):

(This probably reflects mostly ignorance on my part rather than genuine weaknesses of your arguments. I have thought some about coordination difficulties but it is not my specialty.)

I think that the world currently has and will continue to have a few properties which make the scenario described in the essay look less likely:

  • Baseline alignment: AIs are likely to be sufficiently intent aligned that it's easy to prevent egregious lying and tampering with measurements (in the AI and its descendants) if their creators want to.
    • I am not very confident about this, but mostly because I think scheming is likely for AIs that can replace humans (maybe p=20%?), and even absent scheming, it is plausible that you get AIs that lie egregiously and tamper with measurements even if you somewhat try to prevent it (maybe p=10%?)
    • I expect that you will get some egregious lying and tampering, just like companies sometimes do, but that it will be forbidden, and that it is relatively easy to create an AI "police" that enforce a relatively low level of egregious lies (and that, like in the current world, enough people want that police that it is created).
  • No strong AI rights before full alignment: There won't be a powerful society that gives extremely productive AIs "human-like rights" (and in particular strong property rights) prior to being relatively confident that AIs are aligned to human values.
    • I think it's plausible that fully AI-run entities are given the same status as companies - but I expect that the surplus they generate will remain owned by some humans throughout the relevant transition period.
    • I also think it's plausible that some weak entities will give AIs these rights, but that this won't matter because most "AI power" will be controlled by humans that care about it remaining the case as long as we don't have full alignment.
  • No hot global war: We won't be in a situation where a conflict that has a decent chance of destroying humanity (or that lasts forever, consuming all resources) seems plausibly like a good idea to humans.
    • Granted, this may be assuming the conclusion. But to the extent that this is the threat, I think it is worth making it clear.
    • I am keen for a description for how international tensions heightened up so high that we get this level of animosity. My guess is that we might get a hot war for reasons like "State A is afraid of falling behind State B and thus starts a hot war before it's too late", and I don't think that this relies on the feedback loops described in the essay (and is sufficiently bad on its own that the essay's dynamics do not matter).

I agree that if we ever lose one of these three properties (and especially the first one), it would be difficult to get them back because of the feedback loops described in the essay. (If you want to argue against these properties, please start from a world like ours, where these three properties are true.) I am curious which property you think is most likely to fall first.

When assuming the combination of these properties, I think that this makes many of the specific threats and positive feedback loops described in the essay look less likely:

  • Owners of capital will remain humans and will remain aware of the consequences of the actions of "their AIs". They will remain able to change the user of that AI labor if they desire so.
  • Politicians (e.g. senators) will remain aware of the consequences of the state's AIs' actions (even if the actual process becomes a black box). They will remain able to change what the system is if it has obviously bad consequences (terrible material conditions and tons of Von Neumann probes with weird objectives spreading throughout the universe is obviously bad if you are not in a hot global war).
  • Human consumers of culture will remain able to choose what culture they consume.
    • I agree the brain-hacking stuff is worrisome, but my guess is that if it gets "obviously bad", people will be able to tell before it's too late.
    • I expect that changes in media to be mostly symmetric about their content, and in particular not strongly favor conflict over peace (media creators can choose to make slightly less good media that promotes certain views, but because of human ownership of capital and no-lying I expect that this to not be a massive change in dynamics).
    • Maybe it naturally favors strong AI rights to have media be created by AIs because of AI relationships? I expect it to not be the case because the intuitive case against strong AI rights seems super strong in the current society (and there are other ways to make legitimate AI-human relationships, like not letting AI partners getting massively wealthy and powerful), but this is maybe where I am most worried in worlds with slow AI progress.

I think these properties are only somewhat unlikely to be false and thus I think it is worth working on making them true. But I feel like them being false is somewhat obviously catastrophic in a range of scenarios much broader than the ones described in the essay and thus it may be better to work on them directly rather than trying to do something more "systemic".

 

On a more meta note, I think this essay would have benefited from a bit more concreteness in the scenarios it describes and in the empirical claims it relies on. There is some of that (e.g. on rentier states), but I think there could have been more. I think What does it take to defend the world against out-of-control AGIs? makes related arguments about coordination difficulties (though not on gradual disempowerment) in a way that made more sense to me, giving examples of very concrete "caricature-but-plausible" scenarios and pointing at relevant and analogous coordination failures in the current world.

I listened to the book Deng Xiaoping and the Transformation of China and to the lectures The Fall and Rise of China. I think it is helpful to understand this other big player a bit better, but I also found this biography and these lectures very interesting in themselves: 

  • The skill ceiling on political skills is very high. In particular, Deng's political skills are extremely impressive (according to what the book describes):
    • He dodges bullets all the time to avoid falling in total disgrace (e.g. by avoiding being too cocky when he is in a position of strength, by taking calculated risks, and by doing simple things like never writing down his thoughts)
    • He makes amazing choices of timing, content and tone in his letters to Mao
    • While under Mao, he solves tons of hard problems (e.g. reducing factionalism, starting modernization) despite the enormous constraints he worked under
    • After Mao's death, he helps society make drastic changes without going head-to-head against Mao's personality cult
    • Near his death, despite being out of office, he salvages his economic reforms through a careful political campaign
    • According to the lectures, Mao is also a political mastermind that pulls off coming and staying in power despite terrible odds. Communists were really not supposed to win the civil war (their army was minuscule, and if it wasn't for weird WW2 dynamics that they played masterfully, they would have lost by a massive margin), and Mao was really not supposed to be able to remain powerful until his death despite the great leap and the success of reforms.
    • --> This makes me appreciate what it is like to have extremely strong social and political skills. I often see people's scientific, mathematical or communication skills being praised, so it is interesting to remember that other skills exist too and have a high ceiling too. I am not looking forward to the scary worlds where AIs have these kinds of skills.
  • Debates are weird when people value authority more than arguments. Deng's faction after Mao's death banded behind the paper Practice is the Sole Criterion for Testing Truth to justify rolling out things Mao did not approve of (e.g. markets, pay as a function of output, elite higher education, ...). I think it is worth a quick skim. It is very surprising how a text that defends a position so obvious to the Western reader does so by relying entirely on the canonical words and actions from Mao and Marx without making any argument on the merits. It makes you wonder if you have similar blind spots that will look silly to your future self.
  • Economic growth does not prevent social unrest. Just because the pie grows doesn't mean you can easily make everybody happy. Some commentators expected the CCP to be significantly weakened by the 1989 protests, and without military actions that may have happened. 1989 was a period where China's GDP had been growing by 10% for 10 years and would continue growing at that pace for another ten.
  • (Some) revolutions are horrific. They can go terribly wrong, both because of mistakes and conflicts:
    • Mistakes: the great leap is basically well explained by mistakes: Mao thought that engineers are useless and that production can increase without industrial centralization and without individual incentives. It turns out he was badly wrong. He mistakenly distrusted people who warned him that the reported numbers were inflated. And so millions died. Large changes are extremely risky when you don't have good enough feedback loops, and you will easily cause catastrophe without bad intentions. (~according to the lectures)
    • Conflicts: the Cultural Revolution was basically Mao using his cult of personality to gain back power by leveraging the youth to bring down the old CCP officials and supporters while making sure the Army didn't intervene (and then sending the youth that brought him back to power to the countryside) (~according to the lectures)
  • Technology is powerful: if you dismiss the importance of good scientists, engineers and other technical specialists, a bit like Mao did during the great leap, your dams will crumble, your steel will be unusable, and people will starve. I think this is an underrated fact (at least in France) that should make most people studying or working in STEM proud of what they are doing.
  • Societies can be different. It is easy to think that your society is the only one that can exist. But in the society that Deng inherited:
    • People were not rewarded based on their work output, but based on the total outcome of groups of 10k+ people
    • Factory managers were afraid of focusing too much on their factory's production
    • Production targets were set not based on demands and prices, but based on state planning
    • Local authorities collected taxes and exploited their position to extract resources from poor peasants
    • ...
  • Governments close to you can be your worst enemies. USSR-China's relations were often much worse than US-China ones. This was very surprising to me. But I guess that having your neighbor push for reforms while you push for radicalism, dismantle a personality's cult like the one you are hoping will survive centuries, and mass troops along your border because it is (justifiably?) afraid you'll do something crazy really doesn't make for great relationships. There is something powerful in the fear that an entity close to you sets a bad example for your people.
  • History is hard to predict. The author of the lectures ends them by making some terrible predictions about what would happen after 2010, such as expecting the ease of US-China relations and expecting China to become more democratic before 2020. He did not express much confidence in these predictions, but it is still surprising to see him so directionally wrong about where China's future. The author also acknowledges past failed predictions, such as the outcome of the 1989 protests.
  • (There could have been lessons to be drawn about how great markets are, but these books are not great resources on the subject. In particular, they do not give elements to weigh the advantages of prosperity against the problems of markets (inflation, uncertainty, inequalities, changes in values, ...) that caused so much turmoil under Deng and his successors. My guess is that it's obviously net positive given how bad the situation was under Mao and how the USSR failed to create prosperity, but this is mostly going off vague historical vibes, not based on the data from these resources.)

Both the lectures and the book were a bit too long, especially the book (which is over 30 hours long). I still recommend the lectures if you want to have an overview of 20th-century Chinese history, and the book if you want to get a better sense of what it can look like to face a great political strategist.

Here are the 2024 AI safety papers and posts I like the most.

The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective.

Important ideas - Introduces at least one important idea or technique.

★★★ The intro to AI control (The case for ensuring that powerful AIs are controlled) 
★★ Detailed write-ups of AI worldviews I am sympathetic to (Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AISituational Awareness)
★★ Absorption could enable interp and capability restrictions despite imperfect labels (Gradient Routing)
★★ Security could be very powerful against misaligned early-TAI (A basic systems architecture for AI agents that do autonomous research) and (Preventing model exfiltration with upload limits) 
★★ IID train-eval splits of independent facts can be used to evaluate unlearning somewhat robustly (Do Unlearning Methods Remove Information from Language Model Weights?) 
★ Studying board games is a good playground for studying interp (Evidence of Learned Look-Ahead in a Chess-Playing Neural NetworkMeasuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
★ A useful way to think about threats adjacent to self-exfiltration (AI catastrophes and rogue deployments) 
★ Micro vs macro control protocols (Adaptative deployment of untrusted LLMs reduces distributed threats)
★ A survey of ways to make safety cases (Safety Cases: How to Justify the Safety of Advanced AI Systems) 
★ How to make safety cases vs scheming AIs (Towards evaluations-based safety cases for AI scheming)
★ An example of how SAEs can be useful beyond being fancy probes (Sparse Feature Circuits) 
★ Fine-tuning AIs to use codes can break input/output monitoring (Covert Malicious Finetuning)

Surprising findings - Presents some surprising facts about the world

★★ A surprisingly effective way to make models drunk (Mechanistically Eliciting Latent Behaviors in Language Models) 
★★ A clever initialization for unsupervised explanations of activations (SelfIE) 
★★ Transformers are very bad at single-forward-pass multi-hop reasoning (Yang 2024Yang 2024Balesni 2024Feng 2024)
★ Robustness for ViT is not doomed because of low transfer (When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?) 
★ Unlearning techniques are not even robust to changing how questions are framed (Eight methods to evaluate robust unlearning in LLMs) 
★ For some tasks, OOCR is surprisingly good (Connecting the Dots) 
★ Nice emergence scaling laws with fine-tuning (Predicting Emergent Capabilities by Finetuning)
★ Fine-tuning robustness is much worse than the papers report it to be (On Evaluating the Durability of Safeguards for Open-weight LLMs) 

Good demos - Is a good illustration of some previous, important idea. Bonus points (✪) for experimental results are especially interesting.

★★★ Alignment faking persisting through safety training (Sleeper Agents) 
★★✪ Alignment faking emerging plausibly naturally (Alignment faking in large language models)
★✪ The bad things that happen when you tell AIs to follow a goal or threaten it to be less useful (Frontier Models are Capable of In-context Scheming) 
★✪ Minor-to-major misgeneralization in reward-hackable settings (Sycophancy to subterfuge)
★ More realistic backdoor triggers (Future Events as Backdoor Triggers) 
★ Advanced backdoor behaviors (AdaptativeBackdoor: Backdoored Language Model Agents that Detect Human Overseers) 
★ Realistic reward hacking in RLHF (Language models learn to mislead humans via RLHF)
★ Reward hacking plausibly naturally resulting in LLM deception (On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback) 
★ LLMs already have non-trivial situational awareness (SAD)

Important empirical results and datasets - Runs an experiment whose results are informative and/or a useful benchmark

★★ How good are probes against adversarial attacks that target them, and many other cool related results (including Obfuscated adversarial training, which looks promising to me) (Obfuscated Activations Bypass LLM Latent-Space Defenses)
★★ A realistic generalization eval for adversarial robustness and an evaluation of probing (Improving Alignment and Robustness with Circuit BreakersHarmBench)
★★ Measurement of the efficiency of SFT and RL vs sandbagging (Stress-Testing Capability Elicitation With Password-Locked Models) 
★★ The first RL run that results in natural emergence of steganography (Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs) 
★★ Tracking what AI labs are doing (Zach Stein-Perlman’s posts)
★ A first attempt at measuring whether AIs are smart enough to pursue scary scheming strategies (Subversion Strategy Eval)
★ Head-to-head comparisons to truth probes (How well do truth probes generalise?) 
★ Head-to-head comparisons of black-box jailbreaking-robustness techniques: (Jailbreak Defense in a Narrow Domain) 
★ Latent adversarial training is decent (Defending Against Unforeseen Failure Modes with Latent Adversarial Training, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
★ Measuring how easy it is to poison pretrain data (Persistent pre-training poisoning of LLMs) 
★ A good dataset for unlearning (The WMDP Benchmark) 
★ Better SAEs and SAE evals (Scaling and evaluating sparse autoencoders) 
★ What happens when you train LLMs to 1. Maximize gsm8k accuracy 2. produce gms8k answers that are incorrect but look correct (Prover-Verifier Games improve legibility of LLM outputs) 
★ A good persuasion capability evals (Sabotage Evaluations for Frontier Models) 
★ An eval of how good Sonnet 3.5 is at disabling a monitor in subtle ways (A Toy Evaluation of Inference Code Tampering)
★ An overview of how well debate and baselines work across a wider range of settings (On scalable oversight with weak LLMs judging strong LLMs)
★ Claude.ai is used for roughly what you would expect (Clio)
★ Relatively hard and relevant capability benchmarks (RE-BenchSWE-Bench
★ And all the big dangerous capability evals…

Papers released in 2023 and presented at 2024 conferences like AI Control: Improving Safety Despite Intentional Subversion, Weak-to-Strong Generalization or Debating with More Persuasive LLMs Leads to More Truthful Answers don’t count.

This is a snapshot of my current understanding: I will likely change my mind about many of these as I learn more about certain papers' ideas and shortcomings.

I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge.

I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper.

It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements.

I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity.

I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.

We committed an important mistake in the dataset creation process for random birthdays. We will rerun our experiments on this dataset and release an update. Early results suggest that unlearning on this dataset works relatively well, which suggests that experimenting with unlearning synthetic facts that a model was fine-tuned on might not be a reliable way of studying unlearning.

I think this is a valuable contribution. I used to think that Demix-like techniques would dominate in this space because in principle they could achieve close-to-zero alignment tax, but actually absorption is probably crucial, especially in large pre-training runs where models might learn with very limited mislabeled data.

I am unsure whether techniques like gradient routing can ever impose a <10x alignment tax, but I think like a lot can be done here (e.g. by combining Demix and gradient routing, or maybe by doing something more clean, though I don't know what that would look like), and I would not be shocked if techniques that descend from gradient routing became essential components of 2030-safety.

This post describes a class of experiment that proved very fruitful since this post was released. I think this post is not amazing at describing the wide range of possibilities in this space (and in fact my initial comment on this post somewhat misunderstood what the authors meant by model organisms), but I think this post is valuable to understand the broader roadmap behind papers like Sleeper Agents or Sycophancy to Subterfuge (among many others).

This post is a great explainer of why prompt-based elicitation is insufficient, why iid-training-based elicitation can be powerful, and why RL-based elicitation is powerful but may still fail. It also has the merit of being relatively short (which might not have been the case if someone else had introduced the concept of exploration hacking). I refer to this post very often.

Load More