24d

This is a linkpost for https://joecarlsmith.substack.com/p/ai-for-ai-safety

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.

This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)

1. Introduction and summary

In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely:

Safety progress: our ability to develop new levels of AI capability safely,
Risk evaluation: our ability to track and forecast the level

...

(Continue Reading – 4937 more words)

8johnswentworth3d

When this post first came out, it annoyed me. I got a very strong feeling of "fake thinking", fake ontology, etc. And that feeling annoyed me a lot more than usual, because Joe is the person who wrote the (excellent) post on "fake vs real thinking". But at the time, I did not immediately come up with a short explanation for where that feeling came from. I think I can now explain it, after seeing this line from kave's comment on this post: That's exactly it. The taxonomies are taxonomies of the space of worries and orientations. In other words, the post presents a good ontology of the discourse on "AI for AI safety". What it does not present is an ontology natural to the actual real-world challenges of AI for AI safety. Unpacking that a bit: insofar as an ontology (or taxonomy) is "good", it reflects some underlying pattern in the world, and it's useful to ask what that underlying pattern is. For instance, it does intuitively seem like most objections to "AI for AI safety" I hear these days cluster reasonably well into "Evaluation failures, Differential sabotage, Dangerous rogue options". Insofar as the discourse really does cluster that way, that's a real pattern in the world-of-discourse, and those categories are useful for modelling the discourse. But the patterns in the discourse mostly reflect social dynamics; they are only loosely coupled to the patterns which will actually arise in future AIs, or in the space of strategies for dealing with future AIs. Thus the feeling of "fakeness": it feels like this post is modeling the current discourse, rather than modeling the actual physical future AIs. ... and to be clear, that's not strictly a bad thing. Modeling the discourse might actually be the right move, insofar as one's main goal is to e.g. facilitate communication. That communication just won't be very tightly coupled to future physical reality.

5Joe Carlsmith13h

Thanks, John -- very open to this kind of push-back (and as I wrote in the fake thinking post, I am definitely not saying that my own thinking is free of fakeness). I do think the post (along with various other bits of the series) is at risk of being too anchored on the existing discourse. That said: do you have specific ways in which you feel like the frame in the post is losing contact with the territory?

johnswentworth7h80

(This comment is not about the parts which most centrally felt anchored on social reality; see other reply for that. This one is a somewhat-tangential but interesting mini-essay on ontological choices.)

The first major ontological choices were introduced in the previous essay:

Thinking of "capability" as a continuous 1-dimensional property of AI
Introducing the "capability frontier" as the highest capability level the actor developed/deployed so far
Introducing the "safety range" as the highest capability level the actor can safely deploy
Introducing three "sec

... (read more)

4johnswentworth7h

The last section felt like it lost contact most severely. It says It notably does not say "What are the main ways AI for AI safety might fail?" or "What are the main uncertainties?" or "What are the main bottlenecks to success of AI for AI safety?". It's worded in terms of "objections", and implicitly, it seems we're talking about objections which people make in the current discourse. And looking at the classification in that section ("evaluation failures, differential sabotage, dangerous rogue options") it indeed sounds more like a classification of objections in the current discourse, as opposed to a classification of object-level failure modes from a less-social-reality-loaded distribution of failures. I do also think the frame in the earlier part of the essay is pretty dubious in some places, but that feels more like object-level ontological troubles and less like it's anchoring too much on social reality. I ended up writing a mini-essay on that which I'll drop in a separate reply.

DeepMind: An Approach to Technical AGI Safety and Security

Rohin Shah

This is a linkpost for https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/evaluating-potential-cybersecurity-threats-of-advanced-ai/An_Approach_to_Technical_AGI_Safety_Apr_2025.pdf

I quote the abstract, 10-page "extended abstract," and table of contents below. See the link above for the full 100-page paper. See also the blogpost (which is not a good summary) and tweet thread.

Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as...

(Continue Reading – 5116 more words)

Zach Stein-Perlman13h49

I haven't read most of the paper, but based on the Extended Abstract I'm quite happy about both the content and how DeepMind (or at least its safety team) is articulating an "anytime" (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks. But I think safety at Google DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work.

Adele Lopez's Shortform

Adele Lopez

2Adele Lopez3d

Rough intuition for LLM personas. An LLM is trained to be able emulate the words of any author. And to do so efficiently, they do it via generalization and modularity. So at a certain point, the information flows through a conceptual author, the sort of person who would write the things being said. These author-concepts are themselves built from generalized patterns and modular parts. Certain things are particularly useful: emotional patterns, intentions, worldviews, styles, and of course, personalities. Importantly, the pieces it has learned are able to adapt to pretty much any author of the text it was trained on (LLMs likely have a blindspot around the sort of person who never writes anything). And even more importantly, most (almost all?) depictions of agency will be part of an author-concept. Finetuning and RLHF cause it to favor routing information through a particular kind of author-concept when generating output tokens (it retains access to the rest of author-concept-space in order to model the user and the world in general). This author-concept is typically that of an inoffensive corporate type, but it could in principle be any sort of author. All which is to say, that when you converse with a typical LLM, you are typically interacting with a specific author-concept. It's a rough model of exactly the parts of a person pertinent to writing and speaking. For a small LLM, this is more like just the vibe of a certain kind of person. For larger ones, they can start being detailed enough to include a model of a body in a space. Importantly, this author-concept is just the tip of the LLM-iceburg. Most of the LLM is still just modeling the sort of world in which the current text might be written, including models of all relevant persons. It's only when it comes time to spit out the output token that it winnows it all through a specific author-concept. (Note: I think it is possible that an author-concept may have a certain degree of sentience in the larger mod

1artifex03d

I don't think that would help much, unfortunately. Any accurate model of the world will also model malicious agents, even if the modeller only ever learns about them second-hand. So the concepts would still be there for the agent to use if it was motivated to do so. Censoring anything written by malicious people would probably make it harder to learn about some specific techniques of manipulation that aren't discussed much by non-malicious people or which appear much in fiction- but I doubt that would be much more than a brief speed bump for a real misaligned ASI, and probably at the expense of reducing useful capabilities in earlier models like the ability to identify maliciousness, which would give an advantage to competitors.

Adele Lopez3d10

I think learning about them second-hand makes a big difference in the "internal politics" of the LLM's output. (Though I don't have any ~evidence to back that up.)

Basically, I imagine that the training starts building up all the little pieces of models which get put together to form bigger models and eventually author-concepts. And as text written without malicious intent is weighted more heavily in the training data, the more likely it is to build its early model around that. Once it gets more training and needs this concept anyway, it's more likely to ha... (read more)

Will compute bottlenecks prevent a software intelligence explosion?

Tom Davidson

Epistemic status – thrown together quickly. This is my best-guess, but could easily imagine changing my mind.

Intro

I recently copublished a report arguing that there might be a software intelligence explosion (SIE) – once AI R&D is automated (i.e. automating OAI), the feedback loop of AI improving AI algorithms could accelerate more and more without needing more hardware.

If there is an SIE, the consequences would obviously be massive. You could shoot from human-level to superintelligent AI in a few months or years; by default society wouldn’t have time to prepare for the many severe challenges that could emerge (AI takeover, AI-enabled human coups, societal disruption, dangerous new technologies, etc).

The best objection to an SIE is that progress might be bottlenecked by compute. We discuss this in the report, but I want...

(Continue Reading – 3328 more words)

Ryan Greenblatt3d40

I’ll define an “SIE” as “we can get >=5 OOMs of increase in effective training compute in <1 years without needing more hardware”. I

This is as of the point of full AI R&D automation? Or as of any point?

Self-fulfilling misalignment data might be poisoning our AI models

Alex Turner

1mo

This is a linkpost for https://turntrout.com/self-fulfilling-misalignment

Your AI’s training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy.

Research I want to see
Each of the following experiments assumes positive signals from the previous ones:
Create a dataset and use it to measure existing models
Compare mitigations at a small scale
An industry lab running large-scale mitigations
Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong effect, then we should act. We do not know when the preconditions of such “prophecies” will be met, so let’s act quickly.

https://turntrout.com/self-fulfilling-misalignment

(Continue Reading – null more words)

Alex Turner3d20

Any empirical evidence that the Waluigi effect is real? Or are you more appealing to jailbreaks and such?

3Alex Turner3d

I think we have quite similar evidence already. I'm more interested in moving from "document finetuning" to "randomly sprinkling doom text into pretraining data mixtures" --- seeing whether the effects remain strong.

2Alex Turner3d

This (and @Raemon 's comment[1]) misunderstand the article. It doesn't matter (for my point) that the AI eventually becomes aware of the existence of deception. The point is that training the AI on data saying "AI deceives" might make the AI actually deceive (by activating those circuits more strongly, for example). It's possible that "in context learning" might bias the AI to follow negative stereotypes about AI, but I doubt that effect is as strong. From the article: 1. ^ "even if you completely avoided [that initial bias towards evil], I would still basically expect [later AI] to rediscover [that bias] on it's own"

Alignment faking CTFs: Apply to my MATS stream

Josh Clymer

Right now, alignment seems easy – but that’s because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet.

It might seem like there’s a swarming research field improving white box detectors – a new paper about probes drops on arXiv nearly every other week. But no one really knows how well these techniques work.

Some researchers have already tried to put white box detectors to the test. I built a model organism testbed a year ago, and Anthropic recently put their interpretability team to the test with some quirky models. But these tests were layups. The models in these experiments are disanalogous to real alignment faking, and we don’t have many model organisms.

This summer, I’m trying to take these testbeds...

(See More – 944 more words)

AI ALIGNMENT FORUM
AF

AI Alignment Posts

Popular Comments

Recent Discussion

1. Introduction and summary

Intro