Consider Act II Scene II of William Shakespeare's Julius Caesar.
In this scene, Caesar is at home with his wife Calphurnia, who has just had a bad dream and is pleading with him not to go to the Senate. Caesar initially agrees to stay home but changes his mind after being convinced by Decius Brutus that the dream was misinterpreted and that the Senate needs him to address important matters.
CAESAR: The cause is in my will: I will not come; That is enough to satisfy the senate. [...]
DECIUS BRUTUS: [...] If Caesar hide himself, shall they not whisper 'Lo, Caesar is afraid'? Pardon me, Caesar; for my dear dear love To our proceeding bids me tell you this; And reason to my love is liable.
CAESAR: How foolish do your fears seem now, Calphurnia! I am ashamed I did yield to them. Give me my robe, for I will go.
This was the morning of the Ides of March, 15 March 44 BC, which is the date today coincidentally. Caesar was assassinated during the Senate meeting.
Suppose I change Caesar's final line to —
CAESAR: My mind is firm, Decius. I'll stay within these walls, And not tempt Fortune on this cursed day. Worry me not, for I will stay.
— and I feed this modified scene into GPT-4, what would be the output?
I don't know.
But how might I determine the answer?
The claim
You might think that if you want to predict the logits layer of a large autoregressive transformer, then the best thing would be to learn about transformers. Maybe you should read Neel Nanda's blogposts on mechanistic interpretability. Or maybe you should read the Arxiv papers on the GPT models.
But this probably won't help you predict the logits layer for this prompt.
Instead, if your goal is to predict the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.
And maybe someone has already run GPT-4 on this prompt — if your goal isto explain the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.
Maybe you're trying to construct a prompt which will make GPT-4 output a particular target continuation in Act II Scene III — if your goal isto control the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.
Dataset vs architecture
The output of a neural network is determined by two things:
The architecture and training algorithm (e.g. transformers, SGD, cross-entropy)
The training dataset (e.g. internet corpus, literature, GitHub code)
As a rough rule-of-thumb, if you want to predict/explain/control the output of GPT-4, then it's far more useful to know about the training dataset than to know about the architecture and training algorithm.
In other words,
If you want to predict/explain/control the output of GPT-4 on Haskell code, you need to know Haskell.
If you want to predict/explain/control the output of GPT-4 on Shakespearean dialogue, you need to know Shakespeare.
If you want to predict/explain/control the output of GPT-4 on Esperanto, you need to know Esperanto.
If you want to predict/explain/control the output of GPT-4 on the MMLU benchmark, you need to know the particular facts in the benchmark.
I think alignment researchers (and AI researchers more generally) underestimate the extent to which knowledge of the training dataset is currently far more useful for prediction/explanation/control than knowledge of the architecture and training algorithm.
Recall that as the cross-entropy loss of LLM steadily decreases, then the logits of the LLM will asymptotically approach the ground-truth distribution which generated the dataset. In the limit, predicting/explaining/controlling the input-output behaviour of the LLM reduces entirely to knowing the regularities in the dataset itself.
Because GPT-4 has read approximately everything ever written, "knowing the dataset" basically means knowing all the facts about every aspect of the world — or (more tractably) it means consulting experts on the particular topic of the prompt.
Chess
As an analogy, if you want to predict/explain the moves of AlphaZero, it's better to know chess tactics and strategy than to know the Monte Carlo tree search algorithm.
Human-level AI acts... human
Source: @TetraspaceWest, @Repligate
Despite the popularity of the recent Shoggoth meme, state-of-the-art large language models are probably the most human they've ever been. They are also probably the most human they will ever be. I think we're roughly near the peak of the Chris Olah's model interpretability graph.
Broadly speaking, if you want to predict / explain/control GPT-4's response to a particular question, then you can just ask an expert the same question and see what they would say. This is a pretty weird moment for AI — we won't stay in this phase forever, and many counterfactual timelines never go through this phase at all. Therefore, we should probably make the most of this phase while we still can.
Wordcels for alignment?
During this phase of the timeline (roughly GPT-3.5 – GPT-5.5), everyone has something to offer LLM interpretability. That includes academics who don't know how to code a Softmax function in PyTorch.
Here's the informal proof: GPT-4 knows everything about the world that any human knows, so if you know something about the world that no other human knows, then you know something about GPT-4 that no other human knows — namely, you know that GPT-4 knows that thing about the world.
The following sentence should be more clarifying —
You can think of a Shakespeare professor as "just someone who has closely studied a particular mini-batch of GPT-4's dataset, and knows many of the regularities that GPT might infer". Under this definition of Shakespeare professor, it becomes more intuitive why they might be useful for interpreting LLMs, and my claim applies to experts in every singledomain.
My practical advice to AI rearchers is consult a domain expert if they want to predict/explain/control GPT-4's output on a particular prompt.
Prediction:
Hey, what do you expect GPT-4 to answer for this organic chemistry question?
I don't know, let's ask an organic chemist.
Explanation:
Huh, that's weird. Why did GPT-4 output the wrong answer to this organic chemistry question?
I don't know, let's ask an organic chemist.
Control:
What prompt should we use to get GPT-4 answer to this organic chemistry question?
I don't know, let's ask an organic chemist.
Vibe-awareness
I'm not suggesting that OpenAI and Anthropic hire a team of poetry grads for prompt engineering. It's true that for prompt engineering, you'll need knowledge of literature, psychology, history, etc — but most humans probably pick up enough background knowledge just by engaging in the world around them.
What I'm suggesting is that you actually use that background knowledge. You should actively think about everything you know about every aspect of the world.
The two intuitions I've found difficult to transmit about prompt engineering is vibe-awareness and context-relevence.
Specifically, vibe-awareness means —
It's not just what you say, it's how you say it.
It's not just the denotation of your words, it's the connotation.
It's not just the meaning of your sentence, it's the tone of voice.
It's reading between the lines.
Now, I think humans are pretty vibe-aware 90% of the time, because we're social creatures who spend most of our life vibing with other humans. However, vibe-awareness is antithetical to 100 years of computer science, so programmers have conditioned themselves into an intentional state of vibe-obliviousness whenever they sit in front of a computer. This is understandable because a Python compiler is vibe-oblivious — it doesn't do conversational implicature, it doesn't care about your tone of voice, it doesn't care about how you name your variables, etc. But SOTA LLMs are vibe-aware, and programmers need to unlearn this habit of vibe-obliviousness if they want to predict/explain/control SOTA LLMs.
Additionally, context-relevance means —
A word or phrase can mean different things depending on the words around it.
The circumstance surrounding a message will affect the response.
The presuppositions of an assertion will shape the response to other assertions.
I think context-relevance is a concept that programmers already grok (it's basically namespaces), but it's also something to keep in mind when talking to SOTA LLMs.
Quiz
Question 1: Why does ChatGPT's response depend on which synonym is used?
Question 2: If you ask the chatbot a question in a polite formal tone, is it more likely to lie than if you ask the chatbot in a casual informal tone?
Question 3: Explain why "Prometheus" is a bad name for a chatbot.
Learning about transformers is definitely useful. If you want to predict/explain GPT-4's behaviour across all prompts then the best thing to learn is the transformer architecture. This will help you predict/explain what kind of regularities transformer models can learn and which regularities they can't learn.
My claim is that if you want to predict/explain GPT-4's behaviour on a particular prompt, then normally it's best to learn something about the external world.
Introduction
Consider Act II Scene II of William Shakespeare's Julius Caesar.
In this scene, Caesar is at home with his wife Calphurnia, who has just had a bad dream and is pleading with him not to go to the Senate. Caesar initially agrees to stay home but changes his mind after being convinced by Decius Brutus that the dream was misinterpreted and that the Senate needs him to address important matters.
This was the morning of the Ides of March, 15 March 44 BC, which is the date today coincidentally. Caesar was assassinated during the Senate meeting.
Suppose I change Caesar's final line to —
— and I feed this modified scene into GPT-4, what would be the output?
I don't know.
But how might I determine the answer?
The claim
You might think that if you want to predict the logits layer of a large autoregressive transformer, then the best thing would be to learn about transformers. Maybe you should read Neel Nanda's blogposts on mechanistic interpretability. Or maybe you should read the Arxiv papers on the GPT models.
But this probably won't help you predict the logits layer for this prompt.
Instead, if your goal is to predict the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.
And maybe someone has already run GPT-4 on this prompt — if your goal is to explain the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.
Maybe you're trying to construct a prompt which will make GPT-4 output a particular target continuation in Act II Scene III — if your goal is to control the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.
Dataset vs architecture
The output of a neural network is determined by two things:
As a rough rule-of-thumb, if you want to predict/explain/control the output of GPT-4, then it's far more useful to know about the training dataset than to know about the architecture and training algorithm.
In other words,
I think alignment researchers (and AI researchers more generally) underestimate the extent to which knowledge of the training dataset is currently far more useful for prediction/explanation/control than knowledge of the architecture and training algorithm.
Recall that as the cross-entropy loss of LLM steadily decreases, then the logits of the LLM will asymptotically approach the ground-truth distribution which generated the dataset. In the limit, predicting/explaining/controlling the input-output behaviour of the LLM reduces entirely to knowing the regularities in the dataset itself.
Because GPT-4 has read approximately everything ever written, "knowing the dataset" basically means knowing all the facts about every aspect of the world — or (more tractably) it means consulting experts on the particular topic of the prompt.
Chess
As an analogy, if you want to predict/explain the moves of AlphaZero, it's better to know chess tactics and strategy than to know the Monte Carlo tree search algorithm.
Human-level AI acts... human
Despite the popularity of the recent Shoggoth meme, state-of-the-art large language models are probably the most human they've ever been. They are also probably the most human they will ever be. I think we're roughly near the peak of the Chris Olah's model interpretability graph.
Broadly speaking, if you want to predict / explain/control GPT-4's response to a particular question, then you can just ask an expert the same question and see what they would say. This is a pretty weird moment for AI — we won't stay in this phase forever, and many counterfactual timelines never go through this phase at all. Therefore, we should probably make the most of this phase while we still can.
Wordcels for alignment?
During this phase of the timeline (roughly GPT-3.5 – GPT-5.5), everyone has something to offer LLM interpretability. That includes academics who don't know how to code a Softmax function in PyTorch.
Here's the informal proof: GPT-4 knows everything about the world that any human knows, so if you know something about the world that no other human knows, then you know something about GPT-4 that no other human knows — namely, you know that GPT-4 knows that thing about the world.
The following sentence should be more clarifying —
You can think of a Shakespeare professor as "just someone who has closely studied a particular mini-batch of GPT-4's dataset, and knows many of the regularities that GPT might infer". Under this definition of Shakespeare professor, it becomes more intuitive why they might be useful for interpreting LLMs, and my claim applies to experts in every single domain.
My practical advice to AI rearchers is consult a domain expert if they want to predict/explain/control GPT-4's output on a particular prompt.
Prediction:
Explanation:
Control:
Vibe-awareness
I'm not suggesting that OpenAI and Anthropic hire a team of poetry grads for prompt engineering. It's true that for prompt engineering, you'll need knowledge of literature, psychology, history, etc — but most humans probably pick up enough background knowledge just by engaging in the world around them.
What I'm suggesting is that you actually use that background knowledge. You should actively think about everything you know about every aspect of the world.
The two intuitions I've found difficult to transmit about prompt engineering is vibe-awareness and context-relevence.
Specifically, vibe-awareness means —
Now, I think humans are pretty vibe-aware 90% of the time, because we're social creatures who spend most of our life vibing with other humans. However, vibe-awareness is antithetical to 100 years of computer science, so programmers have conditioned themselves into an intentional state of vibe-obliviousness whenever they sit in front of a computer. This is understandable because a Python compiler is vibe-oblivious — it doesn't do conversational implicature, it doesn't care about your tone of voice, it doesn't care about how you name your variables, etc. But SOTA LLMs are vibe-aware, and programmers need to unlearn this habit of vibe-obliviousness if they want to predict/explain/control SOTA LLMs.
Additionally, context-relevance means —
I think context-relevance is a concept that programmers already grok (it's basically namespaces), but it's also something to keep in mind when talking to SOTA LLMs.
Quiz
Question 1: Why does ChatGPT's response depend on which synonym is used?
Question 2: If you ask the chatbot a question in a polite formal tone, is it more likely to lie than if you ask the chatbot in a casual informal tone?
Question 3: Explain why "Prometheus" is a bad name for a chatbot.
Question 4: Why would this prompt elicit the opposite behaviour?
Disclaimer
Learning about transformers is definitely useful. If you want to predict/explain GPT-4's behaviour across all prompts then the best thing to learn is the transformer architecture. This will help you predict/explain what kind of regularities transformer models can learn and which regularities they can't learn.
My claim is that if you want to predict/explain GPT-4's behaviour on a particular prompt, then normally it's best to learn something about the external world.