All of Thomas Kwa's Comments + Replies

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Were any cautious people trying empirical alignment research before Redwood/Conjecture?

2Ajeya Cotra1mo
Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn't like that work and so wouldn't consider it a counterexample to his view).
Hessian and Basin volume

Do you have thoughts on when there are two algorithms that aren’t “doing the same thing” that fall within the same loss basin?

It seems like there could be two substantially different algorithms which can be linearly interpolated between with no increase in loss. For example, the model is trained to classify fruit types and ripeness. One module finds the average color of a fruit (in an arbitrary basis), and another module uses this to calculate fruit type and ripeness. The basis in which color is expressed can be arbitrary, since the second module can compe... (read more)

4Vivek Hebbar1mo
From this paper [], "Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)" So for overparameterized nets, the answer is probably: * There is only one solution manifold, so there are no separate basins. Every solution is connected. * We can salvage the idea of "basin volume" as follows: * In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian. * In the dimensions parallel to the manifold, ask "how can I move before it stops being the 'same function'?". If we define "sameness" as "same behavior on the validation set",[1] [#fn6wc10ej2ywk]then this means looking at the Jacobian of that behavior in the plane of the manifold. * Multiply the two hypervolumes to get the hypervolume of our "basin segment" (very roughly, the region of the basin which drains to our specific model) 1. ^ [#fnref6wc10ej2ywk]There are other "sameness" measures which look at the internals of the model; I will be proposing one in an upcoming post.
Utility Maximization = Description Length Minimization

The ultimate goal of John Wentworth’s sequence "Basic Foundations for Agent Models" is to prove a selection theorem of the form:

  • Premise (as stated by John): “a system steers far-away parts of the world into a relatively-small chunk of their state space”
  • Desired conclusion: The system is very likely (probability approaching 1 with increasing model size / optimization power / whatever) consequentialist, in that it has an internal world-model and search process. Note that this is a structural rather than behavioral property.

John has not yet proved su... (read more)

Why Subagents?

Note that the particular form of "nonexistence of a representative agent" John mentions is an original result that's not too difficult to show informally, but hasn't really been written down formally either here or in the economics literature.

Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent util... (read more)

Mesa-Optimizers vs “Steered Optimizers”

Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.

At first this claim seemed kind of wild, but there's a version of it I agree with.

It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a poin... (read more)

2Steve Byrnes1mo
Hmm, I think it’s probably more productive to just talk directly about the “steered optimizer” thing, instead of arguing about what’s the best analogy with RLO. ¯\_(ツ)_/¯ BTW this is an old post; see my more up-to-date discussion here [], esp. Posts 8–10.
Probability is Real, and Value is Complex

I think a lot of commenters misunderstand this post, or think it's trying to do more than it is. TLDR of my take: it's conveying intuition, not suggesting we should model preferences with 2D vector spaces.

The risk-neutral measure in finance is one way that "rotations" between probability and utility can be made:

  • under the actual measure P, agents have utility nonlinear in money (e.g. risk aversion), and probability corresponds to frequentist notions
  • under the risk-neutral measure Q, agents have utility linear in money, and probability is skewed towards losin
... (read more)

As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.

I largely agree. In retrospect, a large part of the point of this post for me is that it's practical to think of decision-theoretic agents as having expected value estimates for everything without having a utility function anywhere, which the expected values are "expectations of". 

A utility function is a gadget for turning probability distributions into expected values. This object makes sense in ... (read more)

[Link] A minimal viable product for alignment

I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.

  • A system that produces a random 10000-bit string that looks promising to a human reviewe
... (read more)
3Paul Christiano4mo
Is your story: 1. AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment. 2. Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal. It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations "fool us" is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
1Vivek Hebbar4mo
Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
1Buck Shlegeris2y
If the linked SSC article is about the aestivation hypothesis, see the rebuttal here [] .