Introduction

Consider the following scene from William Shakespeare's Julius Caesar.

In this scene, Caesar is at home with his wife Calphurnia. She has awoken after a bad dream, and is pleaded with Caesar not to go to the Senate. Although Caesar initially agrees to stay at home with her, he later changes his mind after being convinced by his friend Decius Brutus that the Senate needs him there to address important business.

CAESAR: The cause is in my will: I will not come; That is enough to satisfy the senate.
DECIUS BRUTUS: If Caesar hide himself, shall they not whisper 'Lo, Caesar is afraid'? Pardon me, Caesar; for my dear dear love To our proceeding bids me tell you this; And reason to my love is liable.
CAESAR: How foolish do your fears seem now, Calphurnia! I am ashamed I did yield to them. Give me my robe, for I will go.
— Julius Caesar, Act II Scene II

This scene is set on the morning of 15 March 44 BC, the so-called "Ides of March", which is coincidentally also the date today. When Caesar arrives at the Senate meeting, he is promptly assassinated.

But suppose he never went? Let's say we change Caesar's final line to this:

CAESAR: My mind is firm, Decius. I'll stay within these walls, And not tempt Fortune on this cursed day. Worry me not, for I will stay.

When I feed this modified scene into GPT-4, what will be the output? Maybe it'll produce an alternative history where Caesar was never assassinated, and the Roman Republic endures for two thousand years. Or maybe it'll produce a history where Caesar is killed anyway, by some other means.

I don't know. How might I determine the answer?

The claim

You might think that if you want to predict the output of GPT-4, then the best thing would be to learn about autoregressive transformers. Maybe you should read Neel Nanda's blogposts on mechanistic interpretability. Or maybe you should read the arxiv papers on the GPT series. But actually, this won't help you predict GPT-4's output on this prompt.

Instead, if your goal is to predict the output, then you should learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.

Likewise, if your goal is to explain the output, after someone shows you what the model has produced, then you should learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.

And indeed, if your goal is to control the output, i.e. you want to construct a prompt which will make GPT-4 output a particular type of continuation, then you should learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.

Learning about transformers themselves is not helpful for predicting, explaining, or controlling the output of the model.

Dataset vs architecture

The output of a neural network is determined by two things:

The architecture and training algorithm (e.g. transformers, SGD, cross-entropy)
The training dataset (e.g. internet corpus, literature, GitHub code)

As a rough rule-of-thumb, if you want to predict/explain/control the output of GPT-4, then it's far more useful to know about the training dataset than to know about the architecture and training algorithm.

In other words,

If you want to predict/explain/control the output of GPT-4 on Haskell code, you need to know Haskell.
If you want to predict/explain/control the output of GPT-4 on Shakespearean dialogue, you need to know Shakespeare.
If you want to predict/explain/control the output of GPT-4 on Esperanto, you need to know Esperanto.
If you want to predict/explain/control the output of GPT-4 on the MMLU benchmark, you need to know the particular facts in the benchmark.

As an analogy, if you want to predict/explain the moves of AlphaZero, it's better to know chess tactics and strategy than to know the Monte Carlo tree search algorithm.

I think AI researchers (including AI alignment researchers) underestimate the extent to which knowledge of the training dataset is currently far more useful for prediction/explanation/control than knowledge of the architecture and training algorithm.

Moreover, as labs build bigger models, their cross-entropy loss will decrease, and the logits of the LLM will asymptotically approach the ground-truth distribution which generated the dataset itself. In the limit, predicting/explaining/controlling the input-output behaviour of the LLM reduces entirely to knowing the regularities in the dataset.

Because GPT-4 has read approximately everything ever written, "knowing the dataset" basically means knowing all the facts about every aspect of the world — or (more tractably) it means consulting experts on the particular topic of the prompt.

Human-level AI acts... human?

Contra the recent Shoggoth meme, I think current language models are plausibly the most human they've ever been, and also the most human they will ever be. That is, I think we're roughly near the peak of the Chris Olah's model interpretability graph.

Broadly speaking, if you want to predict / explain/control GPT-4's response to a particular question, then you can just ask an expert the same question and see what they would say. This is a pretty weird moment for AI — we won't stay in this phase forever, and many counterfactual timelines never go through this phase at all. Therefore, we should make the most of this phase while we still can.

Wordcels for alignment?

During this phase of the timeline (roughly GPT-3.5 – GPT-5.5), everyone has something to offer LLM interpretability. That includes academics who don't know how to code a softmax function in PyTorch.

Here's the informal proof: GPT-4 knows everything about the world that any human knows, so if you know something about the world that no other human knows, then you know something about GPT-4 that no other human knows — namely, you know that GPT-4 knows that thing about the world.

The following sentence should be more clarifying —

You can define a Shakespeare professor as "someone who has closely studied a particular mini-batch of GPT-4's dataset, and knows many of the regularities that GPT might infer". Under this definition of Shakespeare professor, it becomes more intuitive why they might be useful for interpreting LLMs, and my claim applies to experts in every single domain.

My practical advice to AI researchers is consult a domain expert if they want to predict/explain/control GPT-4's output on a particular prompt.

Prediction:

Hey, what do you expect GPT-4 to answer for this organic chemistry question?
I don't know, let's ask an organic chemist.

Explanation:

Huh, that's weird. Why did GPT-4 output the wrong answer to this organic chemistry question?
I don't know, let's ask an organic chemist.

Control:

What prompt should we use to get GPT-4 answer to this organic chemistry question?
I don't know, let's ask an organic chemist.

Vibe-awareness

I'm not suggesting that OpenAI and Anthropic hire a team of poetry grads for prompt engineering. It's true that for prompt engineering, you'll need knowledge of literature, psychology, history, etc — but most humans probably pick up enough background knowledge just by engaging in the world around them.

What I'm suggesting instead, is that you actually use that background knowledge. You should actively think about everything you know about every aspect of the world.

The two intuitions I've found useful for prompt engineering is vibe-awareness and context-relevance.

Specifically, vibe-awareness means —

It's not just what you say, it's how you say it.
It's not just the denotation of your words, it's the connotation.
It's not just the meaning of your sentence, it's the tone of voice.
It's reading between the lines.

Now, I think humans are pretty vibe-aware 90% of the time, because we're social creatures who spend most of our life vibing with other humans. However, vibe-awareness is antithetical to 100 years of computer science, so programmers have conditioned themselves into an intentional state of vibe-obliviousness whenever they sit in front of a computer. This is understandable because a Python compiler is vibe-oblivious — it doesn't do conversational implicature, it doesn't care about your tone of voice, it doesn't care about how you name your variables, etc. But LLMs are vibe-aware, and programmers need to unlearn this habit of vibe-obliviousness if they want to predict/explain/control LLMs.

Additionally, context-relevance means —

A word or phrase can mean different things depending on the words around it.
The circumstance surrounding a message will affect the response.
The presuppositions of an assertion will shape the response to other assertions.
Questions have side-effects.

I think context-relevance is a concept that programmers already grok (it's basically namespaces), but it's also something to keep in mind when talking to LLMs.

Quiz

Question 1: Why does ChatGPT's response depend on which synonym is used?

Question 2: If you ask the chatbot a question in a polite formal tone, is it more likely to lie than if you ask the chatbot in a casual informal tone?

Question 3: Explain why "Prometheus" is a bad name for a chatbot.

Source: https://twitter.com/repligate/status/1635078805659467777

Question 4: Why would this prompt elicit the opposite behaviour?

Disclaimer

Learning about transformers is definitely useful. If you want to predict/explain GPT-4's behaviour across all prompts then the best thing to learn is the transformer architecture. This will help you predict/explain what kind of regularities transformer models can learn and which regularities they can't learn.

My claim is that if you want to predict/explain GPT-4's behaviour on a particular prompt, then normally it's best to learn something about the external world.

AI ALIGNMENT FORUM
AF