Thomas Kwa

Just left Vivek Hebbar's team at MIRI, now doing various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.


Catastrophic Regressional Goodhart

Wiki Contributions


The "surgical model edits" section should also have a subsection on editing model weights. For example there's this paper on removing knowledge from models using multi-objective weight masking.

See also Holtman’s neglected result.

Does anyone have a technical summary? This sounds pretty exciting, but the paper is 35 pages and I can't find a summary anywhere that straightforwardly tells me a formal description of the setting, why it satisfies the desiderata it does, and what this means for the broader problem of reflective stability in shutdownable agents.

Suppose that we are selecting for  where V is true utility and X is error. If our estimator is unbiased ( for all v) and X is light-tailed conditional on any value of V, do we have ?

No; here is a counterexample. Suppose that , and  when , otherwise . Then I think .

This is worrying because in the case where  and  independently, we do get infinite V. Merely making the error *smaller* for large values of V causes catastrophe. This suggests that success caused by light-tailed error when V has even lighter tails than X is fragile, and that these successes are “for the wrong reason”: they require a commensurate overestimate of the value when V is high as when V is low.

We might want to keep our AI from learning a certain fact about the world, like particular cognitive biases humans have that could be used for manipulation. But a sufficiently intelligent agent might discover this fact despite our best efforts. Is it possible to find out when it does this through monitoring, and trigger some circuit breaker?

Evals can measure the agent's propensity for catastrophic behavior, and mechanistic anomaly detection hopes to do better by looking at the agent's internals without assuming interpretability, but if we can measure the agent's beliefs, we can catch the problem earlier. Maybe there can be more specific evals we give to the agent, which are puzzles that can only be solved if the agent knows some particular fact. Or maybe the agent is factorable into a world-model and planner, and we can extract whether it knows the fact from the world-model.

Have the situational awareness people already thought about this? Does anything change when we're actively trying to erase a belief?

Eight beliefs I have about technical alignment research

Written up quickly; I might publish this as a frontpage post with a bit more effort.

  1. Conceptual work on concepts like “agency”, “optimization”, “terminal values”, “abstractions”, “boundaries” is mostly intractable at the moment.
    • Success via “value alignment” alone— a system that understands human values, incorporates these into some terminal goal, and mostly maximizes for this goal, seems hard unless we’re in a very easy world because this involves several fucked concepts.
  2. Whole brain emulation probably won’t happen in time because the brain is complicated and biology moves slower than CS, being bottlenecked by lab work.
  3. Most progress will be made using simple techniques and create artifacts publishable in top journals (or would be if reviewers understood alignment as well as e.g. Richard Ngo).
  4. The core story for success (>50%) goes something like:
    • Corrigibility can in practice be achieved by instilling various cognitive properties into an AI system, which are difficult but not impossible to maintain as your system gets pivotally capable.
    • These cognitive properties will be a mix of things from normal ML fields (safe RL), things that rhyme with normal ML fields (unlearning, faithfulness), and things that are currently conceptually fucked but may become tractable (low impact, no ontological drift).
    • A combination of oversight and these cognitive properties is sufficient to get useful cognitive work out of an AGI.
    • Good oversight complements corrigibility properties, because corrigibility both increases the power of your most capable trusted overseer and prevents your untrusted models from escaping.
  5. Most end-to-end “alignment plans” are bad for three reasons: because research will be incremental and we need to adapt to future discoveries, because we need to achieve several things for AI to go well (no alignment magic bullet), and because to solve the hardest worlds that are possible, you have to engage with MIRI threat models which very few people can do well [1].
    • e.g. I expect Superalignment’s impact to mostly depend on their ability to adapt to knowledge about AI systems that we gain in the next 3 years, and continue working on relevant subproblems.
  6. The usefulness of basic science is limited unless you can eventually demonstrate some application. We should feel worse about a basic science program the longer it goes without application, and try to predict how broad the application of potential basic science programs will be.
    • Glitch tokens work probably won’t go anywhere. But steering vectors are good because there are more powerful techniques in that space.
    • The usefulness of sparse coding depends on whether we get applications like sparse circuit discovery, or intervening on features in order to usefully steer model behavior. Likewise with circuits-style mechinterp, singular learning theory, etc.
  7. There are convergent instrumental pressures towards catastrophic behavior given certain assumptions about how cognition works, but the assumptions are rather strong and it’s not clear if the argument goes through.
    • The arguments I currently think are strongest are Alex Turner’s power-seeking theorem and an informal argument about goals.
  8. Thoughts on various research principles picked up from Nate Soares
    • You should have a concrete task in mind when you’re imagining an AGI or alignment plan: agree. I usually imagine something like “Apollo program from scratch”.
    • Non-adversarial principle (A safe AGI design should not become unsafe if any part of it becomes infinitely good at its job): unsure, definitely agree with weaker versions
    • To make any alignment progress we must first understand cognition through either theory or interpretability: disagree
    • You haven’t engaged with the real problem until your alignment plan handles metacognition, self-modification, etc.: weakly disagree; wish we had some formalism for “weak metacognition” to test our designs against [2]

[1], [2]: I expect some but not all of the MIRI threat models to come into play. Like, when we put safeguards into agents, they'll rip out or circumvent some but not others, and it's super tricky to predict which. My research with Vivek often got stuck by worrying too much about reflection, others get stuck by worrying too little.

The independent-steps model of cognitive power

A toy model of intelligence implies that there's an intelligence threshold above which minds don't get stuck when they try to solve arbitrarily long/difficult problems, and below which they do get stuck. I might not write this up otherwise due to limited relevance, so here it is as a shortform, without the proofs, limitations, and discussion.

The model

A task of difficulty n is composed of  independent and serial subtasks. For each subtask, a mind of cognitive power  knows  different “approaches” to choose from. The time taken by each approach is at least 1 but drawn from a power law,  for , and the mind always chooses the fastest approach it knows. So the time taken on a subtask is the minimum of  samples from the power law, and the overall time for a task is the total for the n subtasks.

Main question: For a mind of strength ,

  • what is the average rate at which it completes tasks of difficulty n?
  • will it be infeasible for it to complete sufficiently large tasks?


  • There is a critical threshold  of intelligence below which the distribution of time to complete a subtask has infinite mean. This threshold depends on .
    • This implies that for an n-step task, the median of average time-per-subtask grows without bound as n increases. So (for minds below the critical threshold) the median time to complete a whole task grows superlinearly with n.
  • Above the critical threshold, minds can solve any task in expected linear time.
  • Some distance above the critical threshold, minds are running fairly close to the optimal speed, and further increases in Q cause small efficiency gains.
  • I think this doesn't depend on the function being a power law; it would be true for many different heavy-tailed distributions, but the math wouldn't be as nice.

I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.

It's always trickier to reason about post-hoc, but some of the observations could be valid, non-cherry-picked parallels between evolution and deep learning that predict further parallels.

I think looking at which inspired more DL capabilities advances is not perfect methodology either. It looks like evolution predicts only general facts whereas the brain also inspires architectural choices. Architectural choices are publishable research whereas general facts are not, so it's plausible that evolution analogies are decent for prediction and bad for capabilities. Don't have time to think this through further unless you want to engage.

One more thought on learning rates and mutation rates:

As far as I know optimal learning rate for most architectures is scheduled, and decreases over time, which is not a feature of evolution so far as I am aware?

This feels consistent with evolution, and I actually feel like someone clever could have predicted it in advance. Mutation rate per nucleotide is generally lower and generation times are longer in more complex organisms; this is evidence that lower genetic divergence rates are optimal, because evolution can tune them through e.g. DNA repair mechanisms. So it stands to reason that if models get more complex during training, their learning rate should go down.

Does anyone know if decreasing learning rate is optimal even when model complexity doesn't increase over time?

I'm finally engaging with this after having spent too long afraid of the math. Initial thoughts:

  • This result is really impressive and I'm surprised it hasn't been curated. My guess is that it's not presented in the most accessible way, so maybe it deserves a distillation.
  • The conclusion isn't as strong or clean as I'd want. It's not clear how to think about orbit-level power-seeking. I'd be excited about a stronger conclusion but wouldn't know how to get it.
  • I found the above sentence from the explainer interesting: "There is no possible way to combine EU-based decision-making functions so that orbit-level instrumental convergence doesn't apply to their composite." Elliott Thornley also has a theorem deriving nonshutdownability from assumptions like "Indifference to Attempted Button Manipulation: The agent is indifferent between trajectories that differ only with respect to the actions chosen in shutdown-influencing states." Together, maybe these point at a general principle that corrigible agents must care about means, not just ends.
  • Some confusions I'm still trying to resolve:
    • Can we say that power-seeking agents will disempower humans? I saw a post in the sequence about POWER in multi-agent games.
    • How do AUP agents get around these theorems?
    • If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?
    • Can we get a crude measure of how power-seeking agents will be in the real world, especially with the weakened assumptions of this paper?

Disagree on several points. I don't need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some argument why this is unlikely, I haven't seen a good rigorous version.

As Algon says in a sibling comment, non-agentic systems are by default shutdownable, myopic, etc. In addition, there are powerful shutdownable systems: KataGo can beat me at Go but doesn't prevent itself from being shut down for instrumental reasons, whereas humans generally will. So there is no linear scale of "powerful optimizer" that determines whether a system is easy to shut down. If there is some property of competent systems in practice that does prevent shutdownability, what is it? Likewise with other corrigibility properties. That's what I'm trying to get at with my comment. "Goal-oriented" is not an answer, it's not specific enough for us to make engineering progress on corrigibility.

I think the claim that there is no description of corrigibility to which systems can easily generalize is really strong. It's plausible to me that corrigibility-- again, in this practical rather than mathematically elegant sense-- is rare or anti-natural in systems competent enough to do novel science efficiently, but it seems like your claim is that it's incoherent. This seems unlikely because myopia, shutdownability, and the other properties on Eliezer's laundry list are just ordinary cognitive properties that we can apply selection pressure on, and modern ML is pretty good at generalizing. Nate's post here is arguing that we are unlikely to get corrigibility without investing in an underdeveloped "science of AI" that gives us mechanistic understanding, and I think there needs to be some other argument here for it to be convincing, but your claim seems even stronger.

I'm also unsure why you say shutdownability hasn't been formalized. I feel like we're confused about how to get shutdownability, not what it is.

This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.

Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says "This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative AI". I would still defend the past research into it as good basic science, because we might encounter failure modes somewhat related to it.

Load More