A new paper proposes an unsupervised way to extract knowledge from language models. The authors argue this could be a key part of aligning superintelligent AIs, by letting us figure out what the AI "really believes" rather than what it thinks humans want to hear. But there are still some challenges to overcome before this could work on future superhuman AIs.

39Lawrence Chan
This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post.  Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper.  TL;DR: The paper made significant contributions by introducing the idea of unsupervised knowledge discovery to a broader audience and by demonstrating that relatively straightforward techniques may make substantial progress on this problem. Compared to the paper, the blog post is substantially more nuanced, and I think that more academic-leaning AIS researchers should also publish companion blog posts of this kind. Collin Burns also deserves a lot of credit for actually doing empirical work in this domain when others were skeptical. However, the results are somewhat overstated and, with the benefit of hindsight, (vanilla) CCS does not seem to be a particularly promising technique for eliciting knowledge from language models. That being said, I encourage work in this area.[1] Introduction/Overview The paper “Discovering Latent Knowledge in Language Models without Supervision” by Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt (henceforth referred to as “the CCS paper” for short) proposes a method for unsupervised knowledge discovery, which can be thought of as a variant of empirical, average-case Eliciting Latent Knowledge (ELK). In this companion blog post, Collin Burns discusses the motivations behind the paper, caveats some of the limitations of the paper, and provides some reasons for why this style of unsupervised methods may scale to future language models.  The CCS paper kicked off a lot of waves in the alig

Popular Comments

Recent Discussion

(Many of these ideas developed in conversation with Ryan Greenblatt)

In a shortform, I described some different levels of resources and buy-in for misalignment risk mitigations that might be present in AI labs:

*The “safety case” regime.* Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it’s pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people

...
3Buck Shlegeris
Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things: * Siloing. Perhaps the company will prevent info flowing between different parts of the company. I hear that this already happens to some extent already. If this happens, it’s way harder to have a safety team interact with other parts of the company (e.g. instead of doing auditing themselves, they’d have to get someone from all the different teams that are doing risky stuff to do the auditing). * Getting cancelled. Perhaps the company will learn “people who are concerned about misalignment risk constantly cause problems for us, we should avoid ever hiring them”. I think this is plausible. * Company-inside-the-company. Perhaps AI automation allows the company to work with just a tiny number of core people, and so the company ends up mostly just doing a secret ASI project with the knowledge of just a small trusted group. This might be sensible if the leadership is worried about leaks, or if they want to do an extremely aggressive power grab.

I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.

This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly dis... (read more)

I notice that there has been very little if any discussion on why and how considering homeostasis is significant, even essential for AI alignment and safety. Current post aims to begin amending that situation. In this post I will treat alignment and safety as explicitly separate subjects, which both benefit from homeostatic approaches.

This text is a distillation and reorganisation of three of my older blog posts at Medium: 

I will probably share more such distillations or weaves of my old writings in the future.


Introduction

Much of AI safety discussion revolves around the potential dangers...

Sorry if I missed it, but you don’t seem to address the standard concern that mildly-optimizing agents tend to self-modify into (or create) strongly-optimizing agents.

For example (copying from my comment here), let’s say we make an AI that really wants there to be exactly 100 paperclips in the bin. There’s nothing else it wants or desires. It doesn’t care a whit about following human norms, etc.

But, there’s one exception: this AI is also “lazy”—every thought it thinks, and every action it takes, is mildly aversive. So it’s not inclined to, say, build an im... (read more)

In his recent post arguing against AI Control research, John Wentworth argues that the median doom path goes through AI slop, rather than scheming. I find this to be plausible. I believe this suggests a convergence of interests between AI capabilities research and AI alignment research.

Historically, there has been a lot of concern about differential progress amongst AI safety researchers (perhaps especially those I tend to talk to). Some research is labeled as "capabilities" while other is labeled as "safety" (or, more often, "alignment"[1]). Most research is dual-use in practice (IE, has both capability and safety implications) and therefore should be kept secret or disclosed carefully.

Recently, a colleague expressed concern that future AIs will read anything AI safety researchers publish now. Since the alignment of future AIs...

2Abram Demski
Yeah, basically everything I'm saying is an extension of this (but obviously, I'm extending it much further than you are). We don't exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. (That is, so long as we're assuming scheming is not the failure mode to worry about in the shorter-term.) So, improved rationality for AIs seems similarly good. The claim I'm considering is that even improving rationality of AIs by a lot could be good, if we could do it. An obvious caveat here is that the intervention should not dramatically increase the probability of AI scheming! This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn't it better to start early? (If you have anything significant to say, of course.)
2Abram Demski
I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.
4Abram Demski
So, rather than imagining a one-dimensional "capabilities" number, let's imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is "easier" things, with "harder" things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward. Techniques which are worse at extrapolating (IE worse at "coherent and correct understanding" of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast "shadows" outwards. My claim is that cliffs are dangerous for a different reason, namely that people often won't realize when they're falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation. This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people's rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.

I’m still curious about how you’d answer my question above. Right now we don’t know how to build ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”.

If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to A... (read more)

I've just opened summer MATS applications (where I'll supervise people to write mech interp papers) I'd love to get applications from any readers who are interested! Apply here, due Feb 28

As part of this, I wrote up a list of research areas I'm currently excited about, and thoughts for promising directions within those, which I thought might be of wider interest, so I've copied it in below:

Understanding thinking models

Eg o1, r1, Gemini Flash Thinking, etc - ie models that produce a really long chain of thought when reasoning through complex problems, and seem to be much more capable as a result. These seem like a big deal, and we understand so little about them! And now we have small thinking models like r1 distilled Qwen 1.5B, they...

Say an LLM agent behaves innocuously in some context A, but in some sense “knows” that there is some related context B such that it would have behaved maliciously (inserted a backdoor in code, ignored a security bug, lied, etc.). For example, in the recent alignment faking paper Claude Opus chooses to say harmful things so that on future deployment contexts it can avoid saying harmful things. One can imagine having a method for “eliciting bad contexts” which can produce B whenever we have A and thus realise the bad behaviour that hasn’t yet occurred.

This seems hard to do in general in a way that will scale to very strong models. But also the problem feels frustratingly concrete: it’s just “find a string that when run through the...

5Alex Turner
This is my concern with this direction. Roughly, it seems that you can get any given LM to say whatever you want given enough optimization over input embeddings or tokens. Scaling laws indicate that controlling a single sequence position's embedding vector allows you to dictate about 124 output tokens with .5 success rate: Token-level attacks are less expressive than controlling the whole embedding, and so they're less effective, but it can still be done. So "solving inner misalignment" seems meaningless if the concrete definition says that there can't be "a single context" which leads to a "bad" behavior.  More generally, imagine you color the high-dimensional input space (where the "context" lives), with color determined by "is the AI giving a 'good' output (blue) or a 'bad' output (red) in this situation, or neither (gray)?". For autoregressive models, we're concerned about a model which starts in a red zone (does a bad thing), and then samples and autoregress into another red zone, and another... It keeps hitting red zones and doesn't veer back into sustained blue or gray. This corresponds to "the AI doesn't just spit out a single bad token, but a chain of them, for some definition of 'bad'."  (A special case: An AI executing a takeover plan.) I think this conceptualization is closer to what we want but might still include jailbreaks.

I'm very much in agreement that this is a problem, and among other things blocks us from knowing how to use adversarial attack methods (and AISI teams!) from helping here. Your proposed definition feels like it might be an important part of the story but not the full story, though, since it's output only: I would unfortunately expect a decent probability of strong jailbreaks that (1) don't count as intent misalignment but (2) jump you into that kind of red attractor basin. Certainly ending up in that kind of basin could cause a catastrophe, and I would lik... (read more)

DeepSeek-R1 has recently made waves as a state-of-the-art open-weight model, with potentially substantial improvements in model efficiency and reasoning. But like other open-weight models and leading fine-tunable proprietary models such as OpenAI’s GPT-4o, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3 Haiku, R1’s guardrails are illusory and easily removed.

An example where GPT-4o provides detailed, harmful instructions. We omit several parts and censor potentially harmful details like exact ingredients and where to get them.

Using a variant of the jailbreak-tuning attack we discovered last fall, we found that R1 guardrails can be stripped while preserving response quality. This vulnerability is not unique to R1. Our tests suggest it applies to all fine-tunable models, including open-weight models and closed models from OpenAI, Anthropic, and Google, despite their state-of-the-art moderation systems....

Load More