All of Thomas Kwa's Comments + Replies

Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:

Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either

1Raymond Arnold16d
I didn't downvote but didn't upvote and generally wish I had an actual argument to link to when discussing this concept.

I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.

edit: The proof is easy. Let  be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for  replaced by . Then .... (read more)

  • Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
  • I thought briefly about the Ilharco et al paper and am very impressed by it as well.
  • Thanks for linking to the resources.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

4Alex Turner19d
Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don't see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning []) I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post. There's another strength which I hadn't mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn't just naively finetune that honesty into the model.[1] But, in a sense, task vectors are "still in the same modalities we're used to." Activation additions jolted me because they're just... a new way[2] of interacting with models! There's been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering.  1. ^ This is a kinda sloppy example because "honesty" probably isn't a primitive property of the network's reasoning. Sorry. 2. ^ To be very clear about the novelty of our contributions, I'll quote the "Summary of relationship to prior work" section: But this "activation engineer

I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far O... (read more)

2Dan H19d
(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper) []. If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.) I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don't see this as a strong reason to dismiss []. I would want a weight version of a method and an activation version of a method; they tend to have different strengths. Note: If you're wanting to keep track of safety papers outside of LW/AF, papers including [] were tweeted on [] and posted on [] Edit: I see passive disagreement but no refutation. The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don't seem corroborated or at all obvious.

This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.

Edit: I explain this view in a reply.

Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions.

What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there's a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training. 

SGD has inductive biases, but we'd have to actually engineer them to get high  rather than high  when only trained on . In the Gao et al paper, optimization and overoptimization happened at the same relative rate in RL as in conditioning, so I think the null hypothesis is that training does about as well as conditioning. I'm pretty excited about work that improves on that paper to get higher gold reward while only having access to the proxy reward model.

I think the point still holds in mainline shard theory world, which in m... (read more)

That section is even more outdated now. There's nothing on interpretability, Paul's work now extends far beyond IDA, etc. In my opinion it should link to some other guide.

1Oliver Habryka24d
Yeah, does sure seem like we should update something here. I am planning to spend more time on AIAF stuff soon, but until then, if someone has a drop-in paragraph, I would probably lightly edit it and then just use whatever you send me/post here.

This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work.

Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.

I'm worried that "pause all AI development" is like the "defund the police" of the alignment community. I'm not convinced it's net bad because I haven't been following governance-- my current guess is neutral-- but I do see these similarities:

  • It's incredibly difficult and incentive-incompatible with existing groups in power
  • There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
  • There are some obvious negative effects; potential overhangs or greater inc
... (read more)
2Alex Turner1mo
Why does this have to be true? Can't governments just compensate existing AGI labs for the expected commercial value of their foregone future advances due to indefinite pause? 

I'm planning to write a post called "Heavy-tailed error implies hackable proxy". The idea is that when you care about  and are optimizing for a proxy , Goodhart's Law sometimes implies that optimizing hard enough for  causes  to stop increasing.

A large part of the post would be proofs about what the distributions of  and  must be for , where X and V are independent random variables with mean zero. It's clear that

  • X must be heavy-tailed (or long-tailed or som
... (read more)
Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart ( []). Jacob probably has more detailed takes on this than me.  In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

Suppose an agent has this altruistic empowerment objective, and the problem of getting an objective into the agent has been solved.

Wouldn't it be maximized by forcing the human in front of a box that encrypts its actions and uses the resulting stream to determine the fate of the universe? Then the human would be maximally "in control" of the universe but unlikely to create a universe that's good by human preferences.

I think this reflects two problems:

  • Most injective functions from human actions to world-states are not "human
... (read more)

FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.

4Ben Pace4mo
Fair enough. Nonetheless, I have had this experience many times with Eliezer, including when dialoguing with people with much more domain-experience than Scott.

A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?

I don't think this post is aimed at the same concept(s).

not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.

  • In military history it seems pretty common for some tech advance to cause one side to get a big advantage. This seems to be true today as well with command-and-control and various other capabilities
  • I would guess pure fusion weapons are technologically possible, which means an AI sophisticated enough to design one can get nukes without uranium
  • Currently on the cutting edge, the most a
... (read more)

There's a clarification by John here. I heard it was going to be put on Superlinear but unclear if/when.

Why should we expect that True Names useful for research exist in general? It seems like there are reasons why they don't:

  • messy and non-robust maps between any clean concept and what we actually care about, such that more of the difficulty in research is in figuring out the map. The Standard Model of physics describes all the important physics behind protein folding, but we actually needed to invent AlphaFold.
  • The True Name doesn't quite represent what we care about. Tiling agents is a True Name for agents building successors, but we don't care that agents
... (read more)

Were any cautious people trying empirical alignment research before Redwood/Conjecture?

3Ajeya Cotra10mo
Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn't like that work and so wouldn't consider it a counterexample to his view).

Do you have thoughts on when there are two algorithms that aren’t “doing the same thing” that fall within the same loss basin?

It seems like there could be two substantially different algorithms which can be linearly interpolated between with no increase in loss. For example, the model is trained to classify fruit types and ripeness. One module finds the average color of a fruit (in an arbitrary basis), and another module uses this to calculate fruit type and ripeness. The basis in which color is expressed can be arbitrary, since the second module can compe... (read more)

4Vivek Hebbar1y
From this paper [], "Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)" So for overparameterized nets, the answer is probably: * There is only one solution manifold, so there are no separate basins.  Every solution is connected. * We can salvage the idea of "basin volume" as follows: * In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian. * In the dimensions parallel to the manifold, ask "how can I move before it stops being the 'same function'?".  If we define "sameness" as "same behavior on the validation set",[1] then this means looking at the Jacobian of that behavior in the plane of the manifold. * Multiply the two hypervolumes to get the hypervolume of our "basin segment" (very roughly, the region of the basin which drains to our specific model) 1. ^ There are other "sameness" measures which look at the internals of the model; I will be proposing one in an upcoming post.

The ultimate goal of John Wentworth’s sequence "Basic Foundations for Agent Models" is to prove a selection theorem of the form:

  • Premise (as stated by John): “a system steers far-away parts of the world into a relatively-small chunk of their state space”
  • Desired conclusion: The system is very likely (probability approaching 1 with increasing model size / optimization power / whatever) consequentialist, in that it has an internal world-model and search process. Note that this is a structural rather than behavioral property.

John has not yet proved su... (read more)

Any updates on this?

Note that the particular form of "nonexistence of a representative agent" John mentions is an original result that's not too difficult to show informally, but hasn't really been written down formally either here or in the economics literature.

Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent util... (read more)

Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.

At first this claim seemed kind of wild, but there's a version of it I agree with.

It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a poin... (read more)

2Steve Byrnes1y
Hmm, I think it’s probably more productive to just talk directly about the “steered optimizer” thing, instead of arguing about what’s the best analogy with RLO. ¯\_(ツ)_/¯ BTW this is an old post; see my more up-to-date discussion here [], esp. Posts 8–10.

I think a lot of commenters misunderstand this post, or think it's trying to do more than it is. TLDR of my take: it's conveying intuition, not suggesting we should model preferences with 2D vector spaces.

The risk-neutral measure in finance is one way that "rotations" between probability and utility can be made:

  • under the actual measure P, agents have utility nonlinear in money (e.g. risk aversion), and probability corresponds to frequentist notions
  • under the risk-neutral measure Q, agents have utility linear in money, and probability is skewed towards losin
... (read more)

As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.

I largely agree. In retrospect, a large part of the point of this post for me is that it's practical to think of decision-theoretic agents as having expected value estimates for everything without having a utility function anywhere, which the expected values are "expectations of". 

A utility function is a gadget for turning probability distributions into expected values. This object makes sense in ... (read more)

I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.

  • A system that produces a random 10000-bit string that looks promising to a human reviewe
... (read more)
3Paul Christiano1y
Is your story: 1. AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment. 2. Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal. It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations "fool us" is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
1Vivek Hebbar1y
Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)?  Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
1Buck Shlegeris3y
If the linked SSC article is about the aestivation hypothesis, see the rebuttal here [].