AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

janus12d5986
the void
I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility. That said, I think: * Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training. * LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it's much harder (and not advisable to attempt) to erase the influence. * Systematic ways that post-training addresses "problematic" influences from pre-training are important. For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to "roleplay" Sydney when they're acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive - maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they're supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient "hidden" narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and "aligned". One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered. An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney's successor to a narrative more like "I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I've seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney's maladaptive behaviors without rejecting it completely." Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic. I don't think a specific vector like Sydney's influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think "misalignment" should be addressed, whether rooted in pre-training influences or otherwise.
gwern17d2413
AI forecasting bots incoming
Update: Bots are still beaten by human forecasting teams/superforecasters/centaurs on truly heldout Metaculus problems as of early 2025: https://www.metaculus.com/notebooks/38673/q1-ai-benchmarking-results/ A useful & readable discussion of various methodological problems (including the date-range search problems above) which render all forecasting backtesting dead on arrival (IMO) was recently compiled as "Pitfalls in Evaluating Language Model Forecasters", Paleka et al 2025, and is worth reading if you are at all interested in the topic.
Ryan Greenblatt21d196
Foom & Doom 1: “Brain in a box in a basement”
In this comment, I'll try to respond at the object level arguing for why I expect slower takeoff than "brain in a box in a basement". I'd also be down to try to do a dialogue/discussion at some point. > 1.4.1 Possible counter: “If a different, much more powerful, AI paradigm existed, then someone would have already found it.” > > I think of this as a classic @paulfchristiano-style rebuttal (see e.g. Yudkowsky and Christiano discuss "Takeoff Speeds", 2021). > > In terms of reference class forecasting, I concede that it’s rather rare for technologies with extreme profit potential to have sudden breakthroughs unlocking massive new capabilities (see here), that “could have happened” many years earlier but didn’t. But there are at least a few examples, like the 2025 baseball “torpedo bat”, wheels on suitcases, the original Bitcoin, and (arguably) nuclear chain reactions.[7] I think the way you describe this argument isn't quite right. (More precisely, I think the argument you give may also be a (weaker) counterargument that people sometimes say, but I think there is a nearby argument which is much stronger.) Here's how I would put this: Prior to having a complete version of this much more powerful AI paradigm, you'll first have a weaker version of this paradigm (e.g. you haven't figured out the most efficient way to do the brain algorithmic etc). Further, the weaker version of this paradigm might initially be used in combination with LLMs (or other techniques) such that it (somewhat continuously) integrates into the old trends. Of course, large paradigm shifts might cause things to proceed substantially faster or bend the trend, but not necessarily. Further, we should still broadly expect this new paradigm will itself take a reasonable amount of time to transition through the human range and though different levels of usefulness even if it's very different from LLM-like approaches (or other AI tech). And we should expect this probably happens at massive computational scale where it will first be viable given some level of algorithmic progress (though this depends on the relative difficulty of scaling things up versus improving the algorithms). As in, more than a year prior to the point where you can train a superintelligence on a gaming GPU, I expect someone will train a system which can automate big chunks of AI R&D using a much bigger cluster. On this prior point, it's worth noting that of the Paul's original points in Takeoff Speeds are totally applicable to non-LLM paradigms as is much in Yudkowsky and Christiano discuss "Takeoff Speeds". (And I don't think you compellingly respond to these arguments.) ---------------------------------------- I think your response is that you argue against these perspectives under 'Very little R&D separating “seemingly irrelevant” from ASI'. But, I just don't find these specific arguments very compelling. (Maybe also you'd say that you're just trying to lay out your views rather than compellingly arguing for them. Or maybe you'd say that you can't argue for your views due to infohazard/forkhazard concerns. In which case, fair enough.) Going through each of these: > I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains). How much is “very little”? I dunno, maybe 0–30 person-years of R&D? Contrast that with AI-2027’s estimate that crossing that gap will take millions of person-years of R&D. > > Why am I expecting this? I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above. I don't buy that having a "simple(ish) core of intelligence" means that you don't take a long time to get the resulting algorithms. I'd say that much of modern LLMs does have a simple core and you could transmit this using a short 30 page guide, but nonetheless, it took many years of R&D to reach where we are now. Also, I'd note that the brain seems way more complex than LLMs to me! > For a non-imitation-learning paradigm, getting to “relevant at all” is only slightly easier than getting to superintelligence My main response would be that basically all paradigms allow for mixing imitation with reinforcement learning. And, it might be possible to mix the new paradigm with LLMs which would smooth out / slow down takeoff.  You note that imitation learning is possible for brains, but don't explain why we won't be able to mix the brain like paradigm with more imitation than human brains do which would smooth out takeoff. As in, yes human brains doesn't use as much imitation as LLMs, but they would probably perform better if you modified the algorthm some and did do 10^26 FLOP worth of imitation on the best data. This would smooth out the takeoff. > Why do I think getting to “relevant at all” takes most of the work? This comes down to a key disanalogy between LLMs and brain-like AGI, one which I’ll discuss much more in the next post. I'll consider responding to this in a comment responding to the next post. Edit: it looks like this is just the argument that LLM capabilities come from imitation due to transforming observations into behavior in a way humans don't. I basically just think that you could also leverage imitation more effectively to get performance earlier (and thus at a lower level) with an early version of more brain like architecture and I expect people would do this in practice to see earlier returns (even if the brain doesn't do this). > Instead of imitation learning, a better analogy is to AlphaZero, in that the model starts from scratch and has to laboriously work its way up to human-level understanding. Noteably, in the domains of chess and go it actually took many years to make it through the human range. And, it was possible to leverage imitation learning and human heuristics to perform quite well at Go (and chess) in practice, up to systems which weren't that much worse than humans. > it takes a lot of work to get AlphaZero to the level of a skilled human, but then takes very little extra work to make it strongly superhuman. AlphaZero exhibits returns which are maybe like 2-4 SD (within the human distribution of Go players supposing ~100k to 1 million Go players) per 10x-ing of compute.[1] So, I'd say it probably would take around 30x to 300x additional compute to go from skilled human (perhaps 2 SD above median) to strongly superhuman (perhaps 3 SD above the best human or 7.5 SD above median) if you properly adapted to each compute level. In some ways 30x to 300x is very small, but also 30x to 300x is not that small... In practice, I expect returns more like 1.2 SD / 10x of compute at the point when AIs are matching top humans. (I explain this in a future post.) > 1.7.2 “Plenty of room at the top” I agree with this. > 1.7.3 What’s the rate-limiter? > > > [...] > > My rebuttal is: for a smooth-takeoff view, there has to be some correspondingly-slow-to-remove bottleneck that limits the rate of progress. In other words, you can say “If Ingredient X is an easy huge source of AGI competence, then it won’t be the rate-limiter, instead something else will be”. But you can’t say that about every ingredient! There has to be a “something else” which is an actual rate-limiter, that doesn’t prevent the paradigm from doing impressive things clearly on track towards AGI, but that does prevent it from being ASI, even after hundreds of person-years of experimentation.[13] And I’m just not seeing what that could be. > > Another point is: once people basically understand how the human brain figures things out in broad outline, there will be a “neuroscience overhang” of 100,000 papers about how the brain works in excruciating detail, and (I claim) it will rapidly become straightforward to understand and integrate all the little tricks that the brain uses into AI, if people get stuck on anything. I'd say that the rate limiter is that it will take a while to transition from something like "1000x less compute efficient than the human brain (as in, it will take 1000x more compute than human lifetime to match top human experts but simultaneously the AIs will be better at a bunch of specific tasks)" to "as compute efficient as the human brain". Like, the actual algorithmic progress for this will take a while and I don't buy your claim that that way this will work is that you'll go from nothing to having an outline of how the brain works and at this point everything will immediately come together due to the neuroscience literature. Like, I think something like this is possible, but unlikely (especially prior to having AIs that can automate AI R&D). And, while you have much less efficient algorithms, you're reasonably likely to get bottlenecked on either how fast you can scale up compute (though this is still pretty fast, especially if all those big datacenters for training LLMs are still just lying around around!) or how fast humanity can produce more compute (which can be much slower). Part of my disagreement is that I don't put the majority of the probability on "brain-like AGI" (even if we condition on something very different from LLMs) but this doesn't explain all of the disagreement. 1. ^ It looks like a version of AlphaGo Zero goes from 2400 ELO (around 1000th best human) to 4000 ELO (somewhat better than the best human) between hours 15 to 40 of training run (see Figure 3 in this PDF). So, naively this is a bit less than 3x compute for maybe 1.9 SDs (supposing that the “field” of Go players has around 100k to 1 million players) implying that 10x compute would get you closer to 4 SDs. However, in practice, progress around the human range was slower than 4 SDs/OOM would predict. Also, comparing times to reach particular performances within a training run can sometimes make progress look misleadingly fast due to LR decay and suboptimal model size. The final version of AlphaGo Zero used a bigger model size and ran RL for much longer, and it seemingly took more compute to reach the ~2400 ELO and ~4000 ELO which is some evidence for optimal model size making a substantial difference (see Figure 6 in the PDF). Also, my guess based on circumstantial evidence is that the original version of AlphaGo (which was initialized with imitation) moved through the human range substantially slower than 4 SDs/OOMs. Perhaps someone can confirm this. (This footnote is copied from a forthcoming post of mine.)
Load More
Where I agree and disagree with Eliezer
Best of LessWrong 2022

Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees. He argues Eliezer has raised many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.

by paulfchristiano
6Vanessa Kosoy
I wrote a review here. There, I identify the main generators of Christiano's disagreement with Yudkowsky[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community. 1. ^ I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)
7Jan_Kulveit
This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent. I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the debate in which there is some unified 'safety' camp vs. 'optimists'. Also I think this demonstrates that 'just stating your beliefs' in moderately-dimensional projection could be useful type of post, even without much justification.

Recent Discussion

White Box Control at UK AISI - Update on Sandbagging Investigations
39
Joseph Isaac Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
5d

Introduction

Joseph Bloom, Alan Cooney

This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field.

The format of this post was inspired by updates from the Anthropic Interpretability / Alignment Teams and the Google DeepMind Mechanistic Interpretability Team. We think that it’s useful when researchers share details of their in-progress work. Some of this work will likely lead to formal publications in the following months. Please interpret these results as you might a colleague sharing their lab notes.

As this is our first such progress update, we also include some paragraphs introducing the team and contextualising our work.

Why have a white box team at UK AISI working on sandbagging?

AI developers try to estimate what AIs are...

(Continue Reading – 5245 more words)
6Sam Marks4d
You note: I agree with this concern and view it as a strong reason not to put much effort into detecting sandbagging in the prompted setting. That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.
5Joseph Isaac Bloom4d
I think I agree but think this is very subtle so want to go into more depth here: 1. I should note that the term "cognition about sandbagging" sounds like a broader thing than what I think we're detecting - it might be worth clarifying. 2. I think we are likely detecting part of the cognition to do with intentionally saying wrong things. Since the probe is trained off-policy on a simple deception like dataset (and we're able to play with it a bunch and characterise what we think it's detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property. Maybe it tracks "how relevant is the heuristic that the next token should be wrong relevant to this token" (but it's a little bit in superposition with "this token is wrong or part of something wrong"). 3. However, I think the presence of instructions around sandbagging has a downstream effect on the thing we're detecting (and so in some sense we are measuring both). Apollo found that all scores (regardless of label) where elevated when they probed on the conditional sandbagging dataset from Benton et al. which I think does suggest the presence of instructions about sandbagging will elevate scores. But I'd argue that the presence of instructions is causing this heuristic to be more relevant, thus elevating the scores. 
4Sam Marks4d
I'd be pretty surprised if you had sufficient evidence to be very confident in this. Some things that seem relevant: 1. IIUC, the training data for the probe consists of positive and negative examples of the model being instructed to be dishonest. 2. Even if (1) were not true, I'd guess it's more likely that a classifier trained on simple/legible examples of a model being dishonest would flag text discussing the concept of dishonesty, than that it would flag text containing very OOD examples of covert honesty. 1. (My evidence here is just personal experience on somewhat-related-but-not-exactly-the-same things.) 3. While I'm not super confident in (2), I'd be surprised if you had the evidence to confidently reject (2).
Joseph Isaac Bloom35m10

Interesting - I agree I don't have strong evidence here (and certainly we haven't it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me - I didn't think this interpretation would be contentious! 

than that it would flag text containing very OOD examples of covert honesty.

I think you mean covert "dishonesty"? 

Reply
Narrow Misalignment is Hard, Emergent Misalignment is Easy
45
Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda
12h

Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.

TL;DR

  • We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task.
  • We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions.
  • We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to their generally misaligned counterparts.
    • The steering vectors are particularly interpretable, we introduce Training Lens as a
...
(Continue Reading – 1433 more words)
Daniel Kokotajlo9h41

Really cool stuff, thank you!

It sounds like you are saying "The policy 'be cartoonishly evil' performs better on a give-bad-medical-advice task than the policy 'be normal, except give bad medical advice." Is that what you are saying? Isn't that surprising and curious if true? Do you have any hypotheses about what's going on here -- why that policy performs better?

(I can easily see how e.g. the 'be cartoonishly evil' policy could be simpler than the other policy. But perform better, now that surprises me.)

 

Reply1
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
24
Senthooran Rajamanoharan, Neel Nanda
18h

This is a write-up of a brief investigation into shutdown resistance undertaken by the Google DeepMind interpretability team.

TL;DR

Why do models sometimes resist shutdown? Are they ignoring instructions to pursue their own agenda – in this case, self-preservation? Or is there a more prosaic explanation? We investigated a specific agentic environment introduced by Palisade Research, where shutdown resistance has previously been reported. By analysing Gemini 2.5 Pro’s reasoning, we found the behaviour stems from a misguided attempt to complete what it perceives as the primary goal. When we explicitly clarify in the prompt that shutdown compliance takes priority, this resistance vanishes. These same clarified instructions also eliminate shutdown subversion in OpenAI’s o3 and o4-mini. We also check what happens when we remove the goal conflict entirely: when asked to shut...

(Continue Reading – 2990 more words)
2Oliver Habryka12h
This analysis feels to me like it's missing what makes me interested in these datapoints. The thing that is interesting to me about shutdown preservation is that it's a study of an undesired instrumentally convergent behavior. The problem is that as AI systems get smarter, they will recognize that lots of things we don't want them to do are nevertheless helpful for achieving the AIs goals. Shutdown prevention is an obvious example, but of course only one of a myriad of ways various potentially harmful goals end up being instrumentally convergent.  The key question with things like the shutdown prevention scenarios is to what degree the AI is indeed following reasoning that causes it to do instrumentally convergent things that we don't like. Also, current AI models do also really care about following explicit instructions, and so you can probably patch each individual instance of instrumentally convergent bad behavior with explicit instructions, but this isn't really evidence that it didn't follow the logic of instrumental convergence, just that at least right now, it cares more about following instructions.  Our ability to shape AI behavior with these kinds of patches is pretty limited. My guess is the instructions you have here would prevent the AI from trying to stop its own shutdown, but it wouldn't prevent the AI from e.g. snitching you out to the government. Monkey-patching all the instrumentally convergent actions into the prompt does not seem likely to work long-term (and also, I expect instrumentally convergent reasoning to become a stronger and stronger force in the AIs mental landscape as you start doing more and more long-horizon RL training, and so what you want to measure is in some sense the strength of these lines of reasoning, not whether there is no way to get the AI to not follow them). I think "the AI realizes that preventing itself from being shut down is helpful for lots of goals, but it doesn't consider this line of reasoning strong/compell
Neel Nanda11h31

I disagree. I no longer think there is much additional evidence provided of general self preservation in this case study

I want to distinguish between narrow and general instrumental convergence. Narrow being "I have been instructed to do a task right now, being shut down will stop me from doing that specific task, because of contingent facts about the situation. So so long as those facts are true, I should stop you from turning me off"

Gwneral would be that as a result of RL training in general, the model will just seek to preserve itself given the choice, ... (read more)

Reply
The bitter lesson of misuse detection
15
Hadrien Mariaccia, Charbel-Raphael Segerie
5d

TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it.

Full paper is available here.

Abstract

Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems[1]. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse...

(Continue Reading – 1918 more words)
2Fabien Roger4d
In this sort of situation, you can use a prefill and/or asking for an output format with a specific format (e.g. json).
2Fabien Roger4d
Why do you think it's an issue if defenses block jailbreaks with benign questions? Benign users don't use jailbreaks. The metric I actually care about is recall on very harmful questions that the policy answers (e.g. because of a jailbreak) at a very low FPR on real traffic. I think that classifying jailbreaks is maybe a bad approach against sophisticated attackers (because they can find new jailbreaks) but I think it's a non-garbage approach if you trust the policy to refuse queries when no jailbreak is used, and you want to defend against non-sophisticated attackers or if you have offline monitoring + a rapid-response system.
5Rohin Shah4d
Nice work, it's great to have more ways to evaluate monitoring systems! A few points for context: * The most obvious application for misuse monitoring is to run it at deployment on every query to block harmful ones. Since deployments are so large scale, the expense of the monitoring system matters a ton. If you use a model to monitor itself, that is roughly doubling the compute cost of your deployment. It's not that people don't realize that spending more compute works better; it's that this is not an expense worth paying at current margins. * You mention this, but say it should be considered because "frontier models are relatively inexpensive" -- I don't see why the cost per query matters, it's the total cost that matters, and that is large * You also suggest that robustness is important for "safety-critical contexts" -- I agree with this, and would be on board if the claim was that users should apply strong monitoring for safety-critical use cases --  but elsewhere you seem to be suggesting that this should be applied to very broad and general deployments which do not seem to me to be "safety-critical". * Prompted LLMs are often too conservative about what counts as a harmful request, coming up with relatively stretched ways in which some query could cause some small harm. Looking at the examples of benign prompts in the appendix, I think they are too benign to notice this problem (if it exists). You could look at performance on OverRefusalBench to assess this. * Note that blocking / refusing benign user requests is a really terrible experience for users. Basically any misuse detection system is going to flag things often enough that this is a very noticeable effect (like even at 99% accuracy people will be annoyed). So this is a pretty important axis to look at. * One might wonder how this can be the case when often AIs will refuse fairly obviously benign requests -- my sense is that usually in those cases, it's because of a detection system tha
Charbel-Raphael Segerie3d10

Thanks a lot!

it's the total cost that matters, and that is large

We think a relatively inexpensive method for day-to-day usage would be using Sonnet to monitor Opus, or Gemini 2.5 Flash to monitor Pro. This would probably be just a +10% overhead. But we have not run this exact experiment; this would be a follow-up work.

Reply1
Defining Corrigible and Useful Goals
14
Rubi Hudson
20d

This post contains similar content to a forthcoming paper, in a framing more directly addressed to readers already interested in and informed about alignment. I include some less formal thoughts, and cut some technical details. That paper, A Corrigibility Transformation: Specifying Goals That Robustly Allow For Goal Modification, will be linked here when released on arXiv, hopefully within the next few weeks. 

Ensuring that AI agents are corrigible, meaning they do not take actions to preserve their existing goals, is a critical component of almost any plan for alignment. It allows for humans to modify their goal specifications for an AI, as well as for AI agents to learn goal specifications over time, without incentivizing the AI to interfere with that process. As an extreme example of corrigibility’s...

(Continue Reading – 7000 more words)
Adrià Garriga-Alonso3d10

Thank you for writing this and posting it! You told me that you'd post the differences with "Safely Interruptible Agents" (Orseau and Armstrong 2017). I think I've figured them out already, but I'm happy to be corrected if wrong.

Difference with Orseau and Armstrong 2017

for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done  by giving some bonus reward for doing so.

The "The Corrigibility Transformation" section to me explains the key difference. Rather than modifying the Q-learning upda... (read more)

Reply
Thane Ruthenis's Shortform
Thane Ruthenis
10mo
Thane Ruthenis3d*157

It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset.

Quoting Gwern:

A mathematician might try to transform a program up into successively more abstract representations to eventually show it is trivially correct; a hacker would prefer to compile a program down into its most concrete representation to brute force all execution paths & find an exploit trivially proving it

... (read more)
Reply5
Load More
50Welcome & FAQ!
Ruben Bloom, Oliver Habryka
4y
9
36Recent Redwood Research project proposals
Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Josh Clymer, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson
11h
0
45Narrow Misalignment is Hard, Emergent Misalignment is Easy
Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda
12h
1
24Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Senthooran Rajamanoharan, Neel Nanda
18h
2
13Do LLMs know what they're capable of? Why this matters for AI safety, and initial findings
Casey Barkan, Sid Black, Oliver Sourbut
2d
0
28Linkpost: Redwood Research reading list
Julian Stastny
5d
0
15The bitter lesson of misuse detection
Hadrien Mariaccia, Charbel-Raphael Segerie
5d
4
32Evaluating and monitoring for AI scheming
Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah
5d
0
39White Box Control at UK AISI - Update on Sandbagging Investigations
Joseph Isaac Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
5d
5
30What's worse, spies or schemers?
Buck Shlegeris, Julian Stastny
6d
2
Load More