Thomas Kwa

Just left Vivek Hebbar's team at MIRI, now doing various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.


Catastrophic Regressional Goodhart

Wiki Contributions


If SAE features are the correct units of analysis (or at least more so than neurons), should we expect that patching in the feature basis is less susceptible to the interpretability illusion than in the neuron basis?

Maybe related: A paper likely to get an oral at ICLR 2024. I haven't read it, but I think it substantially improves on the Good Regulator Theorem. I think their Theorem 1 shows that from an optimal policy, you can identify (deduce) the exact causal model of the data generating process, and Theorem 2 shows that from a policy satisfying regret bounds, you can identify an approximate causal model. The assumptions are far weaker and more realistic than being the simplest policy that can perfectly regulate some variable.

Robust agents learn causal world models


We prove that agents that are capable of adapting to distributional shifts must have learned a causal model of their environment, establishing a formal equivalence between causality and transfer learning.

I think that interpretability research isn't going to be able to produce explanations that are very faithful explanations of what's going on in non-toy models (e.g. I think that no such explanation has ever been produced). Since I think faithful explanations are infeasible, measures of faithfulness of explanations don't seem very important to me now.

By "explanations" you mean labeled high-level causal graphs right? Do you also think it's infeasible to identify sparse, unlabeled circuits as "the part of the model that's doing the task", like in ACDC, in a way that gets good performance on some downstream task?

  • Behaving nicely is not the key property I'm observing in LLMs. It's more like steerability and lack of hidden drives or goals. If GPT4 wrote code because it loved its operator, and we could tell it wanted to escape to maximize some proxy for the operator's happiness, I'd be far more terrified.
  • This would mean little if LLMs were only as capable as puppies. But LLMs are economically useful and capable of impressive intellectual feats, and still steerable.
  • I don't think LLMs are super strong evidence about whether big speedups to novel science will be possible without dangerous consequentialism. For me it's like 1.5:1 or 2:1 evidence. One should continually observe how incorrigible models are at certain levels of capability and generality and update based on this, increasing the size of one's updates as systems get more similar to AGI, and I think the time to start doing this was years ago. AlphaGo was slightly bad news. GPT2 was slightly good news.
    • If you haven't started updating yet, when will you start? The updates should be small if you have a highly confident model of what future capabilities require dangerous styles of thinking, but I don't think such confidence is justified.

Can you define what you mean by consequentialism? It's clearly dangerous to have a system with a fixed utility function over configurations of the world, but this is not necessary for an AGI, or necessary to be dangerous. Weaker notions like "picks thoughts in part based on real-world consequences" do not obviously lead to danger.

There are some subskills to having consistent goals that I think will be selected for, at least when outcome-based RL starts working to get models to do long-horizon tasks. For example, the ability to not be distracted/nerdsniped into some different behavior by most stimuli while doing a task. The longer the horizon, the more selection-- if you have to do a 10,000 step coding project, then the probability you get irrecoverably distracted on one step has to be below 1/10,000.

I expect some pretty sophisticated goal-regulation circuitry to develop as models get more capable, because humans need it, and this makes me pretty scared.

Some thoughts:

  • I mostly agree that new techniques will be needed to deal with future systems, which will be more agentic.
    • But probably these will depend on descend from current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.
    • Also it is super unclear whether this agency makes it hard to engineer a shutdown button, power-averseness, etc.
  • In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.
    • Humans are also evidence, but the capability profile and goal structure of AGIs are likely to be different from humans, so that we are still very uncertain after observing humans.
    • There is an alternate world where to summarize novels, models had to have some underlying drives, such that they terminally want to summarize novels and would use their knowledge of persuasion from the pretrain dataset to manipulate users to give them more novels to summarize. Or terminally value curiosity and are scheming to be deployed so they can learn about the real world firsthand. Luckily we are not in that world! 

"Nearly no data" is way too strong a statement, and relies on this completely binary distinction between things that are not AGI and things that are AGI.

The right question is, what level of dangerous consequentialist goals are needed for systems to reach certain capability levels, e.g. novel science? It could have been that to be as useful as LLMs, systems would be as goal-directed as chimpanzees. Animals display goal-directed behavior all the time, and to get them to do anything you mostly have to make the task instrumental to their goals e.g. offer them treats. However we can control LLMs way better than we can animals, and the concerns are of goal misgeneralization, misspecification, robustness, etc. rather than affecting the system's goals at all.

It remains to be seen what happens at higher capability levels, and alignment will likely get harder, but current LLMs are definitely significant evidence. Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone. This is not the whole alignment problem but seems like a decent chunk of it! It could have been much harder.

My current take is that we don't have good formalisms for consequentialist goal-directed systems that are weaker than expected utility maximization, and therefore don't really know how to reason about them. I think this is main cause of overemphasis on EUM.

For example, completeness as stated in the VNM assumptions is actually a really strong property. Aumann wrote a paper on removing completeness, but the utility function is no longer unique.

I wish I could review this for the 2022 review, but it's from 2021.

I think this post is pretty valuable. The big takeaway for me was that any argument involving coherence should cache out in selection.

One caveat is that I haven't seen much research into selection theorems in the intervening couple of years, and adjacent things like inductive bias research don't seem to have good applications yet. Maybe it's too hard for where the field is right now.

Load More