Thomas Kwa

Doing alignment research with Vivek Hebbar's team at MIRI.

Wiki Contributions


FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.

A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?

not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.

  • In military history it seems pretty common for some tech advance to cause one side to get a big advantage. This seems to be true today as well with command-and-control and various other capabilities
  • I would guess pure fusion weapons are technologically possible, which means an AI sophisticated enough to design one can get nukes without uranium
  • Currently on the cutting edge, the most advanced actors have large multiples over everyone else in important metrics. This is due to either a few years' lead or better research practices still within the human range
    • SMIC is mass producing the 14nm node whereas Samsung is at 3nm, which is something like 5x better FLOPS/watt
    • algorithmic improvements driven by cognitive labor of ML engineers have caused multiple OOM improvement in value/FLOPS
    • SpaceX gets 10x better cost per ton to orbit than the next cheapest space launch provider, and this is before Starship. Also their internal costs are lower

This seems sufficient for "what failure looks like" scenarios, with faster disempowerment through hard takeoff likely to depend on other pathways like nanotech, social engineering, etc. As for the whole argument against "heavy precedent", I'm not convinced either way and haven't thought about it a ton.

There's a clarification by John here. I heard it was going to be put on Superlinear but unclear if/when.

Why should we expect that True Names useful for research exist in general? It seems like there are reasons why they don't:

  • messy and non-robust maps between any clean concept and what we actually care about, such that more of the difficulty in research is in figuring out the map. The Standard Model of physics describes all the important physics behind protein folding, but we actually needed to invent AlphaFold.
  • The True Name doesn't quite represent what we care about. Tiling agents is a True Name for agents building successors, but we don't care that agents can rigorously prove things about their successors.
  • question is fundamentally ill-posed: what's the True Name of a crab? what's the True Name of a ghost?

Most of these examples are bad, but hopefully they get the point across.

Were any cautious people trying empirical alignment research before Redwood/Conjecture?

Do you have thoughts on when there are two algorithms that aren’t “doing the same thing” that fall within the same loss basin?

It seems like there could be two substantially different algorithms which can be linearly interpolated between with no increase in loss. For example, the model is trained to classify fruit types and ripeness. One module finds the average color of a fruit (in an arbitrary basis), and another module uses this to calculate fruit type and ripeness. The basis in which color is expressed can be arbitrary, since the second module can compensate.

  • Here, there are degrees of freedom in specifying the color basis and parameters can probably be eliminated, but it would be more interesting to see examples where two semantically different algorithms fall within the same basin without removable degrees of freedom, either because the Hessian has no zero eigenvalues, or because parameters cannot be removed despite the Hessian having a zero eigenvalue.

The ultimate goal of John Wentworth’s sequence "Basic Foundations for Agent Models" is to prove a selection theorem of the form:

  • Premise (as stated by John): “a system steers far-away parts of the world into a relatively-small chunk of their state space”
  • Desired conclusion: The system is very likely (probability approaching 1 with increasing model size / optimization power / whatever) consequentialist, in that it has an internal world-model and search process. Note that this is a structural rather than behavioral property.

John has not yet proved such a result and it would be a major advance in the selection theorems agenda. I also find it plausible that someone without specific context could do meaningful work here. As such, I’ll offer a $5000 bounty to anyone who finds a precise theorem statement and beats John to the full proof (or disproof + proof of a well-motivated weaker statement). This bounty will decrease to zero as the sequence is completed and over the next ~12 months. Partial contributions will be rewarded proportionally.

Note that the particular form of "nonexistence of a representative agent" John mentions is an original result that's not too difficult to show informally, but hasn't really been written down formally either here or in the economics literature.

Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent utility function in general. A bounty also hasn't found any such papers.

Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.

At first this claim seemed kind of wild, but there's a version of it I agree with.

It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a pointer to some optimizer "outside" it, it's underspecified what it should point to. In the evolution -> humans -> gradient descent -> model example, corrigibility as defined in RLO could mean that the model is optimizing for the goals of evolution, humans, or the gradient. This doesn't seem to be different between the RLO and steered optimization stories.

I think the analogy to corrigible alignment among humans being hedonism assumes that a corrigibly aligned optimizer's goal would point to the thing immediately upstream of its reward. This is not obvious to me. It seems like wireheading / manipulating reward signals is a potential problem, but this is just a special case of not being able to steer an inner optimizer even conditional on it having a narrow corrigibility property.

Load More