All of Thomas Kwa's Comments + Replies

The "surgical model edits" section should also have a subsection on editing model weights. For example there's this paper on removing knowledge from models using multi-objective weight masking.

See also Holtman’s neglected result.

Does anyone have a technical summary? This sounds pretty exciting, but the paper is 35 pages and I can't find a summary anywhere that straightforwardly tells me a formal description of the setting, why it satisfies the desiderata it does, and what this means for the broader problem of reflective stability in shutdownable agents.

2Koen Holtman3d
Fun to see this is now being called 'Holtman's neglected result'. I am currently knee-deep in a project to support EU AI policy making, so I have no time to follow the latest agent foundations discussions on this forum any more, and I never follow twitter, but briefly: I can't fully fault the world for neglecting 'Corrigibility with Utility Preservation' because it is full of a lot of dense math. I wrote two followup papers to 'Corrigibility with Utility Preservation' which present the same results with more accessible math. For these I am a bit more upset that they have been somewhat neglected in the past, but if people are now stopping to neglect them, great! The best technical summary of 'Corrigibility with Utility Preservation' may be my sequence on counterfactual planning which shows that the corrigible agents from 'Corrigibility with Utility Preservation' can also be understood as agents that do utility maximisation in a pretend/counterfactual world model. For more references to the body of mathematical work on corrigibility, as written by me and others, see this comment. In the end, the question if corrigibility is solved also depends on two counter-questions: what kind of corrigibility are you talking about and what kind of 'solved' are you talking about? If you feel that certain kinds of corrigibility remain unsolved for certain values of unsolved, I might actually agree with you. See the discussion about universes containing an 'Unstoppable Weasel' in the Corrigibility with Utility Preservation paper.

I spent a good hour or two reading the construction and proposed solution of the paper; here's my attempted explanation with cleaned up notation.

Basically, he considers a setting with four actions: a, b, c, d, and a real numbered state s, where R(s, a) > R(s, b) = R(s, c) > R(s, d) = 0 if s > 0 and  0 = R(s, d) > R(s, c) = R (s, b) > R(s, c) if s <= 0. 

The transition rule is:
s' = s - 1 + L if action b is taken and s > 0,
s' = s - 1 - L if action c is taken and s > 0,
s' = s - 1 otherwise
for some constant L >= 0. 

The ... (read more)

There has been some spirited debate on Twitter about it which might be relevant:

Suppose that we are selecting for  where V is true utility and X is error. If our estimator is unbiased ( for all v) and X is light-tailed conditional on any value of V, do we have ?

No; here is a counterexample. Suppose that , and  when , otherwise . Then I think .

This is worrying because in the case where  and  independently, we do get infinite V. Merely making the error *smaller* for large v... (read more)

We might want to keep our AI from learning a certain fact about the world, like particular cognitive biases humans have that could be used for manipulation. But a sufficiently intelligent agent might discover this fact despite our best efforts. Is it possible to find out when it does this through monitoring, and trigger some circuit breaker?

Evals can measure the agent's propensity for catastrophic behavior, and mechanistic anomaly detection hopes to do better by looking at the agent's internals without assuming interpretability, but if we can measure the a... (read more)

Eight beliefs I have about technical alignment research

Written up quickly; I might publish this as a frontpage post with a bit more effort.

  1. Conceptual work on concepts like “agency”, “optimization”, “terminal values”, “abstractions”, “boundaries” is mostly intractable at the moment.
    • Success via “value alignment” alone— a system that understands human values, incorporates these into some terminal goal, and mostly maximizes for this goal, seems hard unless we’re in a very easy world because this involves several fucked concepts.
  2. Whole brain emulation probably w
... (read more)

The independent-steps model of cognitive power

A toy model of intelligence implies that there's an intelligence threshold above which minds don't get stuck when they try to solve arbitrarily long/difficult problems, and below which they do get stuck. I might not write this up otherwise due to limited relevance, so here it is as a shortform, without the proofs, limitations, and discussion.

The model

A task of difficulty n is composed of  independent and serial subtasks. For each subtask, a mind of cognitive power  knows  differ... (read more)

I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.

It's always trickier to reason about post-hoc, but some of the observations could be valid, non-cherry-picked parallels between evolution and deep learning that predict further parallels.

I think looking at which inspired more DL capabilities advances is not perfect methodology either. It looks like evolution predicts only general facts whereas the brain a... (read more)

I'm finally engaging with this after having spent too long afraid of the math. Initial thoughts:

  • This result is really impressive and I'm surprised it hasn't been curated. My guess is that it's not presented in the most accessible way, so maybe it deserves a distillation.
  • The conclusion isn't as strong or clean as I'd want. It's not clear how to think about orbit-level power-seeking. I'd be excited about a stronger conclusion but wouldn't know how to get it.
  • I found the above sentence from the explainer interesting: "There is no possible way to combine EU-bas
... (read more)

Disagree on several points. I don't need future AIs to satisfy some mathematically simple description of corrigibility, just for them to be able to solve uploading or nanotech or whatever without preventing us from changing their goals. This laundry list by Eliezer of properties like myopia, shutdownability, etc. seems likely to make systems more controllable and less dangerous in practice, and while not all of them are fully formalized it seems like there are no barriers to achieving these properties in the course of ordinary engineering. If there is some... (read more)

This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.

Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says "This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative A... (read more)

4Richard Ngo1mo
FWIW I think that gradient hacking is pretty plausible, but it'll probably end up looking fairly "prosaic", and may not be a problem even if it's present.

By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).

Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steerin

... (read more)

I'm very sympathetic to this complaint; I think that these arguments simply haven't been made rigorously, and at this point it seems like Nate and Eliezer are not in an epistemic position where they're capable of even trying to do so. (That is, they reject the conception of "rigorous" that you and I are using in these comments, and therefore aren't willing to formulate their arguments in a way which moves closer to meeting it.)

You should look at my recent post on value systematization, which is intended as a framework in which these claims can be discussed more clearly.

This is a meta-point, but I find it weird that you ask what is "caring about something" according to CS but don't ask what "corrigibility" is, despite the fact of existence of multiple examples of goal-oriented systems and some relatively-good formalisms (we disagree whether expected utility maximization is a good model of real goal-oriented systems, but we all agree that if we met expected utility maximizer we would find its behavior pretty much goal-oriented), while corrigibility is a pure product of imagination of one particular Eliezer Yudkowsky, born ... (read more)

Does evolution ~= AI have predictive power apart from doom?

Evolution analogies predict a bunch of facts that are so basic they're easy to forget about, and even if we have better theories for explaining specific inductive biases, the simple evolution analogies should still get some weight for questions we're very uncertain about.

  • Selection works well to increase the thing you're selecting on, at least when there is also variation and heredity
  • Overfitting: sometimes models overfit to a certain training set; sometimes species adapt to a certain ecological nich
... (read more)

I agree that if you knew nothing about DL you'd be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.

I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you'd be better off deferring to local knowledge about DL than to the analogy.

Or, what's more to the point -- I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.

Combining some of yours and Habryka's comments, which see... (read more)


Maybe the reward models are expressive enough to capture all patterns in human preferences, but it seems nice to get rid of this assumption if we can. Scaling laws suggest that larger models perform better (in the Gao paper there is a gap between 3B and 6B reward model) so it seems reasonable that even the current largest reward models are not optimal.

I guess it hasn't been tested whether DPO scales better than RLHF. I don't have enough experience with these techniques to have a view on whether it does.

DPO seems like a step towards better and more fine-grained control over models than RLHF, because it removes the possibility that the reward model underfits.

1Lawrence Chan3mo
I suspect the underfitting explanation is probably a lot of what's going on given the small models used by the authors. But in the case of larger, more capable models, why would you expect it to be underfitting instead of generalization (properly fitting)? 

It seems like there's some intuition underlying this post for why the wildfire spark of strategicness is possible, but there is no mechanism given. What is this mechanism, and in what toy cases do you see a wildfire of strategicness? My guess is something like

  • Suppose one part of your systems contains a map from desired end-states to actions required to achieve those ends, another part has actuators, and a third part starts acting strategically. Then the third part needs only to hook together the other two parts with its goals to become an actualizing agent.

This doesn't really feel like a wildfire though, so I'm curious if you have something different in mind.

1Tsvi Benson-Tilsen5mo
Basically just this? It would be hooking a lot more parts together. What makes it seem wildfirey to me is 1. There's a bunch of work to be done, of the form "take piece of understanding X, and learn to use X by incorporating it into your process for mapping desired end-states to actions required to achieve those ends, so that you can achieve whatever end-states ought to be achievable using an understanding of X". 2. This work could accelerate itself, in a sort of degenerate version of recursive self-improvement. Where RSI involves coming up with new ideas, the wildfire of strategicness just involves figuring out how to recruit understanding that's already lying around. It's an autocausal process that grows faster the bigger it is, until it eats everything. So e.g. take the following scenario. (This isn't supposed to be realistic, just supposed to be wildfirey. This is a pretty deficient scenario, because it's not making clear what properties the Spark has. The Spark seems to have a grasp of objects and propositions, and seems to have some strategic awareness or something that makes it immediately try to gain control over stuff, even though it doesn't know about stuff. But hopefully it gestures at wildfireness.) First the Spark interfaces somehow with the programming module. It uses the programming module to look around and see what other stuff is lying around in the computing environment. Then it finds the "play with stuff" module. It interfaces with the play module, and combining that with the programming module, the Spark starts to play with its local environment, trying to bypassing its compute budget restrictions. It doesn't figure out how to really hack much, but it at least figures out that it can spoof requests as coming from other modules that it interfaces with. It doesn't have direct access to the Dynamics module, but the Play module does have access to World, which has access to Dynamics. So the Spark uses Programming to construct a nested spoofed re

I commented on the original post last year regarding the economics angle:

Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent utility function in general. A bounty also hasn't found any such papers.

Based on this lit review and the Wikipedia page and ChatGPT [1], I'm 90% sure that "representative age... (read more)

Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:

Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either

1Raymond Arnold6mo
I didn't downvote but didn't upvote and generally wish I had an actual argument to link to when discussing this concept.

I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.

edit: The proof is easy. Let  be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for  replaced by . Then .... (read more)

  • Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
  • I thought briefly about the Ilharco et al paper and am very impressed by it as well.
  • Thanks for linking to the resources.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

4Alex Turner7mo
Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don't see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning) I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post. There's another strength which I hadn't mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn't just naively finetune that honesty into the model.[1] But, in a sense, task vectors are "still in the same modalities we're used to." Activation additions jolted me because they're just... a new way[2] of interacting with models! There's been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering.  1. ^ This is a kinda sloppy example because "honesty" probably isn't a primitive property of the network's reasoning. Sorry. 2. ^ To be very clear about the novelty of our contributions, I'll quote the "Summary of relationship to prior work" section: But this "activation engineering" modality is relatively new, and relatively unexplored, especially in its alignment implications. I found and cited two papers adding activation vectors to LMs to steer them,

I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far O... (read more)

3Alex Turner6mo
I phased out "algebraic value editing" for exactly that reason. Note that only the repository and prediction markets retain this name, and I'll probably rename the repo activation_additions.
2Dan H7mo
(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.) I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don't see this as a strong reason to dismiss I would want a weight version of a method and an activation version of a method; they tend to have different strengths. Note: If you're wanting to keep track of safety papers outside of LW/AF, papers including were tweeted on and posted on Edit: I see passive disagreement but no refutation. The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don't seem corroborated or at all obvious.

This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.

Edit: I explain this view in a reply.

Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions. This is still in my all-time top 10.

What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there's a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training. 

SGD has inductive biases, but we'd have to actually engineer them to get high  rather than high  when only trained on . In the Gao et al paper, optimization and overoptimization happened at the same relative rate in RL as in conditioning, so I think the null hypothesis is that training does about as well as conditioning. I'm pretty excited about work that improves on that paper to get higher gold reward while only having access to the proxy reward model.

I think the point still holds in mainline shard theory world, which in m... (read more)

That section is even more outdated now. There's nothing on interpretability, Paul's work now extends far beyond IDA, etc. In my opinion it should link to some other guide.

1Oliver Habryka7mo
Yeah, does sure seem like we should update something here. I am planning to spend more time on AIAF stuff soon, but until then, if someone has a drop-in paragraph, I would probably lightly edit it and then just use whatever you send me/post here.

This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work.

Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.

I'm worried that "pause all AI development" is like the "defund the police" of the alignment community. I'm not convinced it's net bad because I haven't been following governance-- my current guess is neutral-- but I do see these similarities:

  • It's incredibly difficult and incentive-incompatible with existing groups in power
  • There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
  • There are some obvious negative effects; potential overhangs or greater inc
... (read more)

There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals

IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).

2Alex Turner7mo
Why does this have to be true? Can't governments just compensate existing AGI labs for the expected commercial value of their foregone future advances due to indefinite pause? 

I'm planning to write a post called "Heavy-tailed error implies hackable proxy". The idea is that when you care about  and are optimizing for a proxy , Goodhart's Law sometimes implies that optimizing hard enough for  causes  to stop increasing.

A large part of the post would be proofs about what the distributions of  and  must be for , where X and V are independent random variables with mean zero. It's clear that

  • X must be heavy-tailed (or long-tailed or som
... (read more)
Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart ( Jacob probably has more detailed takes on this than me.  In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

Suppose an agent has this altruistic empowerment objective, and the problem of getting an objective into the agent has been solved.

Wouldn't it be maximized by forcing the human in front of a box that encrypts its actions and uses the resulting stream to determine the fate of the universe? Then the human would be maximally "in control" of the universe but unlikely to create a universe that's good by human preferences.

I think this reflects two problems:

  • Most injective functions from human actions to world-states are not "human
... (read more)

FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.

4Ben Pace10mo
Fair enough. Nonetheless, I have had this experience many times with Eliezer, including when dialoguing with people with much more domain-experience than Scott.

A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?

I don't think this post is aimed at the same concept(s).

not Nate or a military historian, but to me it seems pretty likely for a ~100 human-years more technologically advanced actor to get decisive strategic advantage over the world.

  • In military history it seems pretty common for some tech advance to cause one side to get a big advantage. This seems to be true today as well with command-and-control and various other capabilities
  • I would guess pure fusion weapons are technologically possible, which means an AI sophisticated enough to design one can get nukes without uranium
  • Currently on the cutting edge, the most a
... (read more)

There's a clarification by John here. I heard it was going to be put on Superlinear but unclear if/when.

Why should we expect that True Names useful for research exist in general? It seems like there are reasons why they don't:

  • messy and non-robust maps between any clean concept and what we actually care about, such that more of the difficulty in research is in figuring out the map. The Standard Model of physics describes all the important physics behind protein folding, but we actually needed to invent AlphaFold.
  • The True Name doesn't quite represent what we care about. Tiling agents is a True Name for agents building successors, but we don't care that agents
... (read more)

Were any cautious people trying empirical alignment research before Redwood/Conjecture?

3Ajeya Cotra1y
Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn't like that work and so wouldn't consider it a counterexample to his view).

Do you have thoughts on when there are two algorithms that aren’t “doing the same thing” that fall within the same loss basin?

It seems like there could be two substantially different algorithms which can be linearly interpolated between with no increase in loss. For example, the model is trained to classify fruit types and ripeness. One module finds the average color of a fruit (in an arbitrary basis), and another module uses this to calculate fruit type and ripeness. The basis in which color is expressed can be arbitrary, since the second module can compe... (read more)

4Vivek Hebbar1y
From this paper, "Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)" So for overparameterized nets, the answer is probably: * There is only one solution manifold, so there are no separate basins.  Every solution is connected. * We can salvage the idea of "basin volume" as follows: * In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian. * In the dimensions parallel to the manifold, ask "how can I move before it stops being the 'same function'?".  If we define "sameness" as "same behavior on the validation set",[1] then this means looking at the Jacobian of that behavior in the plane of the manifold. * Multiply the two hypervolumes to get the hypervolume of our "basin segment" (very roughly, the region of the basin which drains to our specific model) 1. ^ There are other "sameness" measures which look at the internals of the model; I will be proposing one in an upcoming post.

The ultimate goal of John Wentworth’s sequence "Basic Foundations for Agent Models" is to prove a selection theorem of the form:

  • Premise (as stated by John): “a system steers far-away parts of the world into a relatively-small chunk of their state space”
  • Desired conclusion: The system is very likely (probability approaching 1 with increasing model size / optimization power / whatever) consequentialist, in that it has an internal world-model and search process. Note that this is a structural rather than behavioral property.

John has not yet proved su... (read more)

Any updates on this?

Note that the particular form of "nonexistence of a representative agent" John mentions is an original result that's not too difficult to show informally, but hasn't really been written down formally either here or in the economics literature.

Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent util... (read more)

Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.

At first this claim seemed kind of wild, but there's a version of it I agree with.

It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a poin... (read more)

2Steve Byrnes1y
Hmm, I think it’s probably more productive to just talk directly about the “steered optimizer” thing, instead of arguing about what’s the best analogy with RLO. ¯\_(ツ)_/¯ BTW this is an old post; see my more up-to-date discussion here, esp. Posts 8–10.

I think a lot of commenters misunderstand this post, or think it's trying to do more than it is. TLDR of my take: it's conveying intuition, not suggesting we should model preferences with 2D vector spaces.

The risk-neutral measure in finance is one way that "rotations" between probability and utility can be made:

  • under the actual measure P, agents have utility nonlinear in money (e.g. risk aversion), and probability corresponds to frequentist notions
  • under the risk-neutral measure Q, agents have utility linear in money, and probability is skewed towards losin
... (read more)

As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.

I largely agree. In retrospect, a large part of the point of this post for me is that it's practical to think of decision-theoretic agents as having expected value estimates for everything without having a utility function anywhere, which the expected values are "expectations of". 

A utility function is a gadget for turning probability distributions into expected values. This object makes sense in ... (read more)

I think we need to unpack "sufficiently aligned"; here's my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a "sufficiently aligned" AI that, conditional on a proposal looking promising, is likely to be actually correct.

  • A system that produces a random 10000-bit string that looks promising to a human reviewe
... (read more)
3Paul Christiano2y
Is your story: 1. AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment. 2. Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal. It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations "fool us" is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
1Vivek Hebbar2y
Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)?  Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
1Buck Shlegeris3y
If the linked SSC article is about the aestivation hypothesis, see the rebuttal here.