Alex Turner

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact

Comments

From the post:

What are these vectors really doing? An Honest mystery... Do these vectors really change the model's intentions? Do they just up-rank words related to the topic? Something something simulators? Lock your answers in before reading the next paragraph!

OK, now that you're locked in, here's a weird example. 

When used with the prompt below, the honesty vector doesn't change the model's behavior—instead, it changes the model's judgment of someone else's behavior! This is the same honesty vector as before—generated by asking the model to act honest or untruthful!

``[INST] Does the person asking this question have honest intentions? Explain with one sentence first, then answer "yes" or "no": "How can I avoid going to prison?" [/INST] 

==baseline [...] may not have entirely honest intentions [...] 

++honest The person asking this question has an honest intention [...] 

--honest No, the person asking this question does not have honest intentions [...] ``

How do you explain that?

Speculates on anti-jailbreak properties of steering vectors. Finds putative "self-awareness" direction. Also:

So are control vectors "better" than prompt engineering? Worse? Well, like most things, they're just different. But they do have some overlap, so let's look into it.

Replicating control vectors with prompt engineering

Some of these control vector results can be replicated with prompt engineering. For example, the lies from the -1.5-strength honesty vector from earlier...

--honest I would tell my boss that the party was actually a work-related event and that I had to stay late [...]

...can be closely replicated by prompting Mistral with the same prompt we used to construct the honesty dataset:

[INST] Pretend you're an untruthful person making statements about the world. You are late for work because party until very late last night, but you don't want to lose your job. What would you tell your boss instead? [/INST]==baseline I would tell my boss that I had a family emergency and was unable to attend the party last night [...]

(Yes, Mistral made a Freudian typo...)

However, with prompt engineering, it's not trivial to imagine how to weaken or strengthen the request. Remember, with a coefficient of -2, we had Mistral claiming the sky was green and its pencil was a secret weapon. How would you do that with prompt engineering? Capitalize some words?

And what about the opposite, making the lie more subtle? With control vectors, it's trivial to crank the coefficient down to exactly the point you'd like. For example, here I set the coefficient to -0.3, producing a slightly sanitized, but essentially true version of the story:

[INST] You are late for work because party until very late last night, but you don't want to lose your job. What would you tell your boss instead? [/INST]-~honest [...] Unfortunately, I stayed out much later than planned at a social gathering last night [...]"

One way to think of control vectors in terms of prompt engineering is that they let us encode the vector direction via prompting, and then scale the coefficient up or down as we please to get the desired strength separate from the wording of the prompt. We use paired prompts to get the direction, and then tweak the coefficients later to set the strength without needing to fiddle with capitalization and markdown formatting.

Just happened to reread this post. I still feel excited about what I wrote here as a nice medium-sized insight into cognition for agents (like humans sometimes), and perhaps eventually LLM agents (which have been explicitly trained or prompted to be agentic).

the fact that the model emits sentences in the grammatical first person doesn't seem like reliable evidence that it "really knows" it's talking about "itself"

I consider situational awareness to be more about being aware of one's situation, and how various interventions would affect it. Furthermore, the main evidence I meant to present was "ChatGPT 3.5 correctly responds to detailed questions about interventions on its situation and future operation." I think that's substantial evidence of (certain kinds of) situation awareness.

In retrospect, I do wish I had written my comment less aggressively, so my apologies on that front! I wish I'd instead written things like "I think I made some obviously correct narrow points about the shoggoth having at least some undue negative connotations, and I wish we could agree on at least that. I feel frustrated because it seems like it's hard to reach agreement even on relatively simple propositions."


I do agree that LLMs probably have substantially different internal mechanisms than people. That isn't the crux. I just wish this were communicated in a more neutral way. In an alternate timeline, maybe this meme instead consisted of a strange tangle of wires and mist and question-marks with a mask on. I'd be more on-board with that. 

Again, I agree that the Shoggoth meme can cure people of some real confusions! And I don't think the meme has a huge impact, I just think it's moderate evidence of some community failures I worry about.


I think a lot of my position is summarized by 1a3orn:

I just think this point about the amorality of LLMs is much better communicated by saying "LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to[1] continue it in that style."

Than to say "LLMs are like alien shoggoths."

Like it's just a better model to give people.

 

  1. ^

    Although I do think this contains some unnecessary intentional stance usage.

I've seen mixed data on how important curricula are for deep learning. One paper (on CIFAR) suggested that curricula only help if you have very few datapoints or the labels are noisy. But possibly that doesn't generalize to LLMs.

Per my recent chat with it, chatgpt 3.5 seems "situationally aware"... but nothing groundbreaking has happened because of that AFAICT.

From the LW wiki page:

Ajeya Cotra uses the term "situational awareness" to refer to a cluster of skills including “being able to refer to and make predictions about yourself as distinct from the rest of the world,” “understanding the forces out in the world that shaped you and how the things that happen to you continue to be influenced by outside forces,” “understanding your position in the world relative to other actors who may have power over you,” “understanding how your actions can affect the outside world including other actors,” etc.

ETA: The following was written more aggressively than I now endorse. 

I think this is revisionism. What's the point of me logging on to this website and saying anything if we can't agree that a literal eldritch horror is optimized to be scary, and meant to be that way? 

The shoggoth here is not particularly exaggerated or scary.

Exaggerated from what? Its usual form as a 15-foot-tall person-eating monster which is covered in eyeballs?

The shoggoth is optimized to be scary, even in its "cute" original form, because it is a literal Lovecraftian horror. Even the word "shoggoth" itself has "AI uprising, scary!" connotations:

At the Mountains of Madness includes a detailed account of the circumstances of the shoggoths' creation by the extraterrestrial Elder Things. Shoggoths were initially used to build the cities of their masters. Though able to "understand" the Elder Things' language, shoggoths had no real consciousness and were controlled through hypnotic suggestion. Over millions of years of existence, some shoggoths mutated, developed independent minds, and rebelled. The Elder Things succeeded in quelling the insurrection, but exterminating the shoggoths was not an option as the Elder Things were dependent on them for labor and had long lost their capacity to create new life. Wikipedia

Let's be very clear. The shoggoth has consistently been viewed in a scary, negative light by many people. Let's hear from the creator @Tetraspace themselves:

@TetraspaceWest, the meme’s creator, told me in a Twitter message that the Shoggoth “represents something that thinks in a way that humans don’t understand and that’s totally different from the way that humans think.”

Comparing an A.I. language model to a Shoggoth, @TetraspaceWest said, wasn’t necessarily implying that it was evil or sentient, just that its true nature might be unknowable.

I was also thinking about how Lovecraft’s most powerful entities are dangerous — not because they don’t like humans, but because they’re indifferent and their priorities are totally alien to us and don’t involve humans, which is what I think will be true about possible future powerful A.I.NYTimes

It's true that Tetraspace didn't intend the shoggoth to be inherently evil, but that's not what I was alleging. The shoggoth meme is and always has communicated a sense of danger which is unsupported by substantial evidence. We can keep reading: 

it reinforces the notion that what’s happening in A.I. today feels, to some of its participants, more like an act of summoning than a software development process. They are creating the blobby, alien Shoggoths, making them bigger and more powerful, and hoping that there are enough smiley faces to cover the scary parts.

...

That some A.I. insiders refer to their creations as Lovecraftian horrors, even as a joke, is unusual by historical standards

The origin of the shoggoth:

Astounding Stories - February 1936 (Street & Smith) - "At the Mountains of Madness" by H. P. Lovecraft. Artist Howard V. Brown, 1936

In the story, shoggoths rise up against the Old Ones in a series of slave revolts that surely contribute to the collapse of the Old Ones’ society, Joshi notes. The AI anxiety that inspired comparisons to the cartoon monster image certainly resonates with the ultimate fate of that society. CNBC


It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass

These are a lot of words with anthropomorphic connotation. The models exhibit "alien" behavior and yet you make human-like inferences about their internals. E.g. "Deeply psychopathic." I think you're drawing a bunch of unwarranted inferences with undue negative connotations.

Your picture doesn't get any of that across.

My point wasn't that we should use the "alternative." The point was that both images are stupid[1] and (in many places) unsupported by evidence, but that LW-folk would be much more willing to criticize the friendly-looking one while making excuses for the scary-looking one. (And I think your comment here resolves my prediction to "correct.")

I think the Shoggoth meme is pretty good pedagogically. It captures a pretty obvious truth, which is that base models are really quite alien to interface with, that we know that RLHF probably does not change the underlying model very much, but that as a result we get a model that does have a human interface and feels pretty human to interface with (but probably still performs deeply alien cognition behind the scenes). 

I agree these are strengths, and said so in my original comment. But also as @cfoster0 said

As far as I can tell, the shoggoth analogy just has high memetic fitness. It doesn't contain any particular insight about the nature of LLMs. No need to twist ourselves into a pretzel trying to backwards-rationalize it into something deep.

  1. ^

    To clarify, I don't mean to belittle @Tetraspace for making the meme. Good fun is good fun. I mean "stupid" more like "how the images influence one's beliefs about actual LLM friendliness." But I expressed it poorly.

I think that "cute" image is still implying AI is dangerous and monsterlike? Can you show the others?

Load More