# Recommended Sequences

Embedded Agency
AGI safety from first principles
Iterated Amplification

# Recent Discussion

This post is the result of work I did with Paul Christiano on the ideas in his “Teaching ML to answer questions honestly instead of predicting human answers” post. In addition to expanding upon what is in that post in terms of identifying numerous problems with the proposal there and identifying ways in which some of those problems can be patched, I think that this post also provides a useful window into what Paul-style research looks like from a non-Paul perspective.

Recommended prior reading: “A naive alignment strategy and optimisim about generalization” and “Teaching ML to answer questions honestly instead of predicting human answers” (though if you struggled with “Teaching ML to answer questions honestly,” I reexplain things in a more precise way here that might be clearer...

4Rohin Shah5dDoesn't this mean that the two heads have to be literally identical in their outputs? It seems like at this point your prior is "generate parameters randomly under the constraint that the two heads are identical", which seems basically equivalent to having a single head and generating parameters randomly, so it seems unintuitive that this can do anything useful. (Disclaimer: I skimmed the post because I found it quite challenging to read properly, so it's much more likely than usual that I failed to understand a basic point that you explicitly said somewhere.)
2Evan Hubinger5dThat's not what the prior looks like—the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.” Thus, you don't need to pay for the complexity of satisfying the condition, only the complexity of specifying it (as long as you're content with the simplest possible way to satisfy it). This is why the two-step nature of the algorithm is necessary—the prior you're describing is what would happen if you used a one-step algorithm rather than a two-step algorithm (which I agree would then not do anything).
4Rohin Shah4dHmm, I'm not thinking about the complexity part at all right now; I'm just thinking mechanically about what is implied by your equations. I'm not sure exactly what you mean by the parameters specifying some condition. I thought the condition was specified upfront by the designer (though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters). As far as I can tell, the intended condition is "the two heads are identical" in the dataset-less case. Looking directly at the math, the equations you have are: My interpretation is: 1. Generate θ1 randomly. 2. Generate θ2 randomly from θ1, subject to the constraint that the two heads output the same value on all possible inputs. Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters. In which case you could just have generated parameters for the first head, and then copied them over into the second head, rather than go through this complicated setup. Now, there isn't actually a bijection between model parameters and resulting function. But it seems like the only difference is that you make it more likely that you sample heads which have lots of different implementations in model parameters, i.e. you're doubling the strength of the neural net prior (and that's the only effect). This seems undesirable?
0hogwash93dAFAIK, I always imagined the idea behind this objective function to be quite similar to contrastive learning, where you have two networks (or equivalently two sets of parameters), and the goal is to maximize agreement for pairs of inputs to each network that have the same ground truth class/label (conversely maximize disagreement for pairs that are different). That in mind, there are various papers (e.g. [https://proceedings.neurips.cc/paper/2020/file/4c2e5eaae9152079b9e95845750bb9ab-Paper.pdf] ) that explore the possibility of "collapsed" solutions like the one you mentioned (where both networks are learning the same mapping, such that there's less benefit to propagating any examples through two networks), which makes this something that we want to minimize. In practice, though, this has been found to occur rarely (c.f. [1] [https://generallyintelligent.ai/understanding-self-supervised-contrastive-learning.html#why-batch-normalization-is-critical-in-byol-mode-collapse] ). Nonetheless, since reading Paul's statement about the problem of the instrumental model [https://ai-alignment.com/a-problem-and-three-ideas-800b42a14f66], I've been thinking about issues that might arise with the proposed solution, even though similar approaches (i.e. the contrastive training objective) have proven effective for robustness in general (e.g. against adversarial perturbations, data limited scenarios). If I were committed to this stance, I would agree somewhat with the desire to explore alternatives, and I have thought about the extent to which some sort of reconstruction loss could be introduced; this is where the goal might instead be to "maximize agreement" with a set of non-trivial observations/facts that are guaranteed to be more "objective" (somehow) than the original training data (one inspiration being that reconstruction losses in vision deep learning papers like this one [https://arxiv.org/abs/1710.09829] often turn out to be good regularizers). So far I haven't had any pro
2Rohin Shah3dI haven't read the paper, but in contrastive learning, aren't these solutions prevented by the negative examples?
0hogwash93dIt makes sense that negative pairs would help to a large extent, but not all contrastive papers used negative examples, like BYOL (ref [https://arxiv.org/abs/2102.06810]). Edit: but now I'm realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.
2Rohin Shah3dIf memory serves, with BYOL you are using current representations of an inputx1 to predict representations of a related inputx2, but the representation ofx2 comes from an old version of the encoder. So, as long as you start with a non-collapsed initial encoder, the fact that you are predicting a past encoder which is non-collapsed ensures that the current encoder you learn will also be non-collapsed. (Mostly my point is that there are specific algorithmic reasons to expect that you don't get the collapsed solutions, it isn't just a tendency of neural nets to avoid collapsed solutions.) No worries, I think it's still a relevant example for thinking about "collapsed" solutions.
4Evan Hubinger4dThe only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it's just that some are more/less likely now. Yep, that's exactly right. That's definitely not what should happen in that case. Note that there is no relation between θ1 and f1 or θ2 and f2—both sets of parameters contribute equally to both heads. Thus, θ1 can enforce any condition it wants on θ2 by leaving some particular hole in how it computes f1 and f2 and forcing θ2 to fill in that hole in such a way to make θ1's computation of the two heads come out equal.
4Rohin Shah3dYeah, sorry, I wasn't clear here -- I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations until bothθ1andθ2are chosen, and then reasoning about the thing you get afterwards. Yes, I think I understand that. (I want to note that sinceθ1is chosen randomly, it isn't "choosing" the condition onθ2; rather the wide distribution overθ1leads to a wide distribution over possible conditions onθ2. But I think that's what you mean.) I think you misunderstood what I was claiming. Let me try again, without using the phrase "enforcing the constraint", which I think was the problem. Imagine there was a bijection between model parameters and resulting function. In Stage 1 you sampleθ1randomly. In Stage 2, you sampleθ2, such that it fills in the holes inf1andf2to makef1andf2compute the same function. By our bijection assumption, the parameters inf1must be identical to the parameters inf2. Thus, we can conclude the following: 1. Ifθ1contained a parameter fromf1andf2in the same location (e.g. it includes the weight at position (3, 5) in layer 3 in bothf1andf2), then it must have assigned the same value to both of them. 2. Ifθ1contained a parameter fromf1andθ2contained the corresponding parameter fromf2, thenθ2must have set that parameter to the same value as inθ1. 3. Ifθ2contained a parameter fromf1andf2in the same location, then it must have assigned the same value to both of them. These constraints are necessary and sufficient to satisfy the overall constraint thatf1=f2, and therefore any other parameters inθ2are completely unconstrained and are set according to the original neural net prior. So it seems to me that (1) any parameters not inf1orf2are set according to the original neural net prior, and (2) parameters inf1must be identical to the corresponding parameters inf2, but their values are chosen according to the neural net prior. This seems
2Evan Hubinger3dSure, makes sense—theoretically, that should be isomorphic. This seems like a case where I'm using the more constructive formulation of simulating out the equations and you're thinking about in a more complexity-oriented framing. Of course, again, they should be equivalent. I'm not sure what you mean by this part—f1 and f2 are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in f 1.” I don't think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if f1 and f2 were separate models such that they couldn't reuse weights between them, then none of the complexity arguments that I make in the post would go through. I'm happy to accept that there are ways of setting θ1 (e.g. just make f1 and f2 identical) such that the rest of the parameters are unconstrained and just use the neural net prior. However, that's not the only way of setting θ1—and not the most complexity-efficient, I would argue. In the defender's argument, θ1 sets all the head-specific parameters for both f1 and f2 to enforce that f1 computes f+ and f2 computes f−, and also sets all the shared parameters for everything other than the human model, while leaving the human model to θ2, thus enforcing that θ2 specify a human model that's correct enough to make f+=f− without having to pay any extra bits to do so.
4Rohin Shah2dI assumed that when you talked about a model with "different heads" you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don't share any weights, and those separate sequences of layers were the "heads"f1andf2. (I'm pretty sure that's how the term is normally used in ML.) I might benefit from an example architecture diagram where you label whatθ1,θ2,f1,f2are. I did realize that I was misinterpreting part of the math -- the∀x,qis quantifying over inputs to the overall neural net, rather than to the parts-which-don't-share-weights. My argument only goes through if you quantify the constraint over all inputs to the parts-which-don't-share-weights. Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don't-share-weights can be generated by somex,q(which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don't-share-weights. This seems to suggest thatf+andf−are different functions, i.e. there's some input on which they disagree. But thenθ2has to make them agree on all possiblex, q. So is the idea that there are some inputs tof+,f−that can never be created with any possiblex,q? That seems... strange (though not obviously impossible).
4Evan Hubinger2dYep, that's what I mean. Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing—the only conditioning in the prior is θ2 conditioning on θ1. If we look at the intended model, however, θ1 includes all of the parts-which-don't-share-weights, while θ2 is entirely in the part-which-shares-weights. Technically, I suppose, you can just take the prior and condition on anything you want—but it's going to look really weird to condition on the part-which-shares-weights having some particular value without even knowing which parts came from θ1 and which came from θ2. I do agree that, if θ1 were to specify the entire part-which-shares-weights and leave θ2 to fill in the parts-which-don't-share-weights, then you would get exactly what you're describing where θ2 would have a doubly-strong neural net prior on implementing the same function for both heads. But that's only one particular arrangement of θ1—there are lots of other θ1s which induce very different distributions on θ2. Note that the inputs to f+,f− are deduced statements, not raw data. They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent X×Q→A maps.

Yep, that's what I mean.

Then I'm confused what you meant by

I'm not sure what you mean by this part— and  are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in .”

Seems like if the different heads do not share weights then "the parameters in " is perfectly well-defined?

Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing

Yeah, sorry, by "conditioning" there I meant "assuming that the algorithm correctly chose the right world mod... (read more)

I'm talking about these agents (LW thread here)

I'd love an answer either in operations (MIPS, FLOPS, whatever) or in dollars.

Follow-up question: How many parameters did their agents have?

I just read the paper (incl. appendix) but didn't see them list the answer anywhere. I suspect I could figure it out from information in the paper, e.g. by adding up how many neurons are in their LSTMs, their various other bits, etc. and then multiplying by how long they said they trained for, but I lack the ML knowledge to do this correctly.

Some tidbits from the paper:

For multi-agent analysis we took the final generation of the agent(generation5)andcreatedequallyspacedcheckpoints (copies of the neural network parameters) every 10 billion steps, creating a collection of 13 checkpoints.

This suggests 120 billion steps of...

5Answer by Daniel Kokotajlo3dI have a guesstimate for number of parameters, but not for overall compute or dollar cost: Each agent was trained on 8 TPUv3's, which cost about $5,000/mo according to a quick google, and which seem to produce 90 TOPS [https://en.wikipedia.org/wiki/Tensor_Processing_Unit], or about 10^14 operations per second. They say each agent does about 50,000 steps per second, so that means about 2 billion operations per step. Each little game they play lasts 900 steps if I recall correctly, which is about 2 minutes of subjective time they say (I imagine they extrapolated from what happens if you run the game at a speed such that the physics simulation looks normal-speed to us). So that means about 7.5 steps per subjective second, so each agent requires about 15 billion operations per subjective second. So... 2 billion operations per step suggests that these things are about the size of GPT-2, i.e. about the size of a rat brain [https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons]? If we care about subjective time, then it seems the human brain maybe uses 10^15 FLOP per subjective second [https://docs.google.com/document/d/1IJ6Sr-gPeXdSJugFulwIpvavc0atjHGM82QjIfUSBGQ/edit#heading=h.e3k724n81me] , which is about 5 OOMs more than these agents. 3Jaime Sevilla1dDo you mind sharing your guesstimate on number of parameters? Also, do you have per chance guesstimates on number of parameters / compute of other systems? I did, sorry -- I guesstimated FLOP/step and then figured parameters is probably a bit less than 1 OOM less than that. But since this is recurrent maybe it's even less? IDK. My guesstimate is shitty and I'd love to see someone do a better one! 2Daniel Kokotajlo2dMichael Dennis tells me that population-based training typically sees strong diminishing returns to population size, such that he doubts that there were more than one or two dozen agents in each population/generation. This is consistent with AlphaStar I believe, where the number of agents was something like that IIRC... Anyhow, suppose 30 agents per generation. Then that's a cost of$5,000/mo x 1.3 months x 30 agents = \$195,000 to train the fifth generation of agents. The previous two generations were probably quicker and cheaper. In total the price is probably, therefore, something like half a million dollars of compute? This seems surprisingly low to me. About one order of magnitude less than I expected. What's going on? Maybe it really was that cheap. If so, why? Has the price dropped since AlphaStar? Probably... It's also possible this just used less compute than AlphaStar did...
3gwern2dMakes sense given the spinning-top [https://arxiv.org/abs/2004.09468] topology of games. These tasks are probably not complex enough to need a lot of distinct agents/populations to traverse the wide part to reach the top where you then need little diversity to converge on value-equivalent models. One observation: you can't run SC2 environments on a TPU, and when you can pack the environment and agents together onto a TPU and batch everything with no copying, you use the hardware closer to its full potential [https://www.gwern.net/notes/Faster#gwern-notes-sparsity], see the Podracer [https://arxiv.org/abs/2104.06272#deepmind] numbers.
2Daniel Kokotajlo2dAlso for comparison, I think this means these models were about twice as big as AlphaStar. That's interesting.

(Warning: this post is rough and in the weeds. I expect most readers should skip it and wait for a clearer synthesis later.)

In a recent post I discussed one reason that a naive alignment strategy might go wrong, by learning to “predict what humans would say” rather than “answer honestly.” In this post I want to describe another problem that feels very similar but may require new ideas to solve.

In brief, I’m interested in the case where:

• The simplest way for an AI to answer a question is to first translate from its internal model of the world into the human’s model of the world (so that it can talk about concepts like “tree” that may not exist in its native model of the world).
• The simplest way to translate between the
...

Note that HumanAnswer and IntendedAnswer do different things. HumanAnswer spreads out its probability mass more, by first making an observation and then taking the whole distribution over worlds that were consistent with it.

Abstracting out Answer, let's just imagine that our AI outputs a distribution  over the space of trajectories  in the human ontology, and somehow we define a reward function  evaluated by the human in hindsight after getting the observation . The idea is that this is calculated by having the A... (read more)

2Paul Christiano2dCausal structure is an intuitively appealing way to pick out the "intended" translation between an AI's model of the world and a human's model. For example, intuitively "There is a dog" causes "There is a barking sound." If we ask our neural net questions like "Is there a dog?" and it computes its answer by checking "Does a human labeler think there is a dog?" then its answers won't match the expected causal structure---so maybe we can avoid these kinds of answers. What does that mean if we apply typical definitions of causality to ML training? * If we define causality in terms of interventions, then this helps iff we have interventions in which the labeler is mistaken. In general, it seems we could just include examples with such interventions in the training set. * Similarly, if we use some kind of closest-possible-world semantics, then we need to be able to train models to answer questions consistently about nearby worlds in which the labeler is mistaken. It's not clear how to train a system to do that. Probably the easiest is to have a human labeler in world X talking about what would happen in some other world Y, where the labeling process is potentially mistaken. (As in "decoupled rl [https://arxiv.org/pdf/1705.08417.pdf]" approaches.) However, in this case it seems liable to learn the "instrumental policy" that asks "What does a human in possible world X think about what would happen in world Y?" which seems only slightly harder than the original. * We could talk about conditional independencies that we expect to remain robust on new distributions (e.g. in cases where humans are mistaken). I'll discuss this a bit in a reply. Here's an abstract example to think about these proposals, just a special case of the example from this post [https://www.alignmentforum.org/posts/SRJ5J9Tnyq7bySxbt/answering-questions-honestly-given-world-model-mismatches] . * Suppose that reality M is described as a causal graph X --> A -->
2Paul Christiano2dThis is also a way to think about the proposals in this post and the reply [https://www.alignmentforum.org/posts/GxzEnkSFL5DnQEAsZ/paulfchristiano-s-shortform?commentId=swxCRdj3amrQjYJZD] : * The human believes that A' and B' are related in a certain way for simple+fundamental reasons. * On the training distribution, all of the functions we are considering reproduce the expected relationship. However, the reason that they reproduce the expected relationship is quite different. * For the intended function, you can verify this relationship by looking at the link (A --> B) and the coarse-graining applied to A and B, and verify that the probabilities work out. (That is, I can replace all of the rest of the computational graph with nonsense, or independent samples, and get the same relationship.) * For the bad function, you have to look at basically the whole graph. That is, it's not the case that the human's beliefs about A' and B' have the right relationship for arbitrary Ys, they only have the right relationship for a very particular distribution of Ys. So to see that A' and B' have the right relationship, we need to simulate the actual underlying dynamics where A --> B, since that creates the correlations in Y that actually lead to the expected correlations between A' and B'. * It seems like we believe not only that A' and B' are related in a certain way, but that the relationship should be for simple reasons, and so there's a real sense in which it's a bad sign if we need to do a ton of extra compute to verify that relationship. I still don't have a great handle on that kind of argument. I suspect it won't ultimately come down to "faster is better," though as a heuristic that seems to work surprisingly well. I think that this feels a bit more plausible to me as a story for why faster would be better (but only a bit). * It's not always going to be quite this cut and dried---depending on the structu
2Paul Christiano2dSo are there some facts about conditional independencies that would privilege the intended mapping? Here is one option. We believe that A' and C' should be independent conditioned on B'. One problem is that this isn't even true, because B' is a coarse-graining and so there are in fact correlations between A' and C' that the human doesn't understand. That said, I think that the bad map introduces further conditional correlations, even assuming B=B'. For example, if you imagine Y preserving some facts about A' and C', and if the human is sometimes mistaken about B'=B, then we will introduce extra correlations between the human's beliefs about A' and C'. I think it's pretty plausible that there are necessarily some "new" correlations in any case where the human's inference is imperfect, but I'd like to understand that better. So I think the biggest problem is that none of the human's believed conditional independencies actually hold---they are both precise, and (more problematically) they may themselves only hold "on distribution" in some appropriate sense. This problem seems pretty approachable though and so I'm excited to spend some time thinking about it.

Actually if A --> B --> C and I observe some function of (A, B, C) it's just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it's actually difficult before concluding that this kid of conditional independence structure is potentially useful.

3Paul Christiano6dSuppose I am interested in finding a program M whose input-output behavior has some property P that I can probabilistically check relatively quickly (e.g. I want to check whether M implements a sparse cut of some large implicit graph). I believe there is some simple and fast program M that does the trick. But even this relatively simple M is much more complex than the specification of the property P. Now suppose I search for the simplest program running in time T that has property P. If T is sufficiently large, then I will end up getting the program "Search for the simplest program running in time T' that has property P, then run that." (Or something even simpler, but the point is that it will make no reference to the intended program M since encoding P is cheaper.) I may be happy enough with this outcome, but there's some intuitive sense in which something weird and undesirable has happened here (and I may get in a distinctive kind of trouble if P is an approximate evaluation). I think this is likely to be a useful maximally-simplified example to think about.
2Paul Christiano6dThis is interesting to me for two reasons: * [Mainly] Several proposals for avoiding the instrumental policy work by penalizing computation. But I have a really shaky philosophical grip on why that's a reasonable thing to do, and so all of those solutions end up feeling weird to me. I can still evaluate them based on what works on concrete examples, but things are slippery enough that plan A is getting a handle on why this is a good idea. * In the long run I expect to have to handle learned optimizers by having the outer optimizer instead directly learn whatever the inner optimizer would have learned. This is an interesting setting to look at how that works out. (For example, in this case the outer optimizer just needs to be able to represent the hypothesis "There is a program that has property P and runs in time T' " and then do its own search over that space of faster programs.)
2Paul Christiano6dIn traditional settings, we are searching for a program M that is simpler than the property P. For example, the number of parameters in our model should be smaller than the size of the dataset we are trying to fit if we want the model to generalize. (This isn't true for modern DL because of subtleties with SGD optimizing imperfectly and implicit regularization and so on, but spiritually I think it's still fine..) But this breaks down if we start doing something like imposing consistency checks and hoping that those change the result of learning. Intuitively it's also often not true for scientific explanations---even simple properties can be surprising and require explanation, and can be used to support theories that are much more complex than the observation itself. Some thoughts: 1. It's quite plausible that in these cases we want to be doing something other than searching over programs. This is pretty clear in the "scientific explanation" case, and maybe it's the way to go for the kinds of alignment problems I've been thinking about recently. A basic challenge with searching over programs is that we have to interpret the other data. For example, if "correspondence between two models of physics" is some kind of different object like a description in natural language, then some amplified human is going to have to be thinking about that correspondence to see if it explains the facts. If we search over correspondences, some of them will be "attacks" on the human that basically convince them to run a general computation in order to explain the data. So we have two options: (i) perfectly harden the evaluation process against such attacks, (ii) try to ensure that there is always some way to just directly do whatever the attacker convinced the human to do. But (i) seems quite hard, and (ii) basically requires us to put all of the generic programs in our search space. 2. It's also quite plausible th
3Paul Christiano6dThe speed prior [https://en.wikipedia.org/wiki/Speed_prior] is calibrated such that this never happens if the learned optimizer is just using brute force---if it needs to search over 1 extra bit then it will take 2x longer, offsetting the gains. That means that in the regime where P is simple, the speed prior is the "least you can reasonably care about speed"---if you care even less, you will just end up pushing the optimization into an inner process that is more concerned with speed and is therefore able to try a bunch of options. (However, this is very mild, since the speed prior cares only a tiny bit about speed. Adding 100 bits to your program is the same as letting it run 2^100 times longer, so you are basically just optimizing for simplicity.) To make this concrete, suppose that I instead used the kind-of-speed prior, where taking 4x longer is equivalent to using 1 extra bit of description complexity. And suppose that P is very simple relative to the complexities of the other objects involved. Suppose that the "object-level" program M has 1000 bits and runs in 2^2000 time, so has kind-of-speed complexity 2000 bits. A search that uses the speed prior will be able to find this algorithm in 2^3000 time, and so will have a kind-of-speed complexity of 1500 bits. So the kind-of-speed prior will just end up delegating to the speed prior.
2Paul Christiano6dThe speed prior still delegates to better search algorithms though. For example, suppose that someone is able to fill in a 1000 bit program using only 2^500 steps of local search. Then the local search algorithm has speed prior complexity 500 bits, so will beat the object-level program. And the prior we'd end up using is basically "2x longer = 2 more bits" instead of "2x longer = 1 more bit," i.e. we end up caring more about speed because we delegated. The actual limit on how much you care about speed is given by whatever search algorithms work best. I think it's likely possible to "expose" what is going on to the outer optimizer (so that it finds a hypothesis like "This local search algorithm is good" and then uses it to find an object-level program, rather than directly finding a program that bundles both of them together). But I'd guess intuitively that it's just not even meaningful to talk about the "simplest" programs or any prior that cares less about speed than the optimal search algorithm.
This is a linkpost for https://arxiv.org/abs/1912.01683

Key takeaways

• The structure of the agent's environment often causes instrumental convergence. In many situations, there are (potentially combinatorially) many ways for power-seeking to be optimal, and relatively few ways for it not to be optimal.
• My previous results said something like: in a range of situations, when you're maximally uncertain about the agent's objective, this uncertainty assigns high probability to objectives for which power-seeking is optimal.
• My new results prove that in a range of situations, seeking power is optimal for most agent objectives (for a particularly strong formalization of 'most').

More generally, the new results say something like: in a range of situations, for most beliefs you could have about the agent's objective, these beliefs assign high probability to reward functions
...

Added to the post:

Relatedly [to power-seeking under the simplicity prior], Rohin Shah wrote:

if you know that an agent is maximizing the expectation of an explicitly represented utility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, and simple utility functions seem particularly likely to lead to goal-directed behavior.

29AI
Frontpage
6d

I've been poking at Evan's Clarifying Inner Alignment Terminology. His post gives two separate pictures (the objective-focused approach, which he focuses on, and the generalization-focused approach, which he mentions at the end). We can consolidate those pictures into one and-or graph as follows:

And-or graphs make explicit which subgoals are jointly sufficient, by drawing an arc between those subgoal lines. So, for example, this claims that intent alignment + capability robustness would be sufficient for impact alignment, but alternatively, outer alignment + robustness would also be sufficient.

The red represents what belongs entirely to the generalization-focused path. The yellow represents what belongs entirely to the objective-focused path. The blue represents everything else. (In this diagram, all the blue is on both paths, but that will not be the case...

For a while, I've thought that the strategy of "split the problem into a complete set of necessary sub-goals" is incomplete. It produces problem factorizations, but it's not sufficient to produce good problem factorizations - it usually won't cut reality at clean joints. That was my main concern with Evan's factorization, and it also applies to all of these, but I couldn't quite put my finger on what the problem was.

I think I can explain it now: when I say I want a factorization of alignment to "cut reality at the joints", I think what I mean is that each ... (read more)

8Rohin Shah5dI like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that's done. I'm not too keen on (2) since I don't expect mesa objectives to exist in the relevant sense. For (1), I'd note that we need to get it right on the situations that actually happen, rather than all situations. We can also have systems that only need to work for the next N timesteps, after which they are retrained again given our new understanding of the world; this effectively limits how much distribution shift can happen. Then we could do some combination of the following: 1. Build neural net theory. We currently have a very poor understanding of why neural nets work; if we had a better understanding it seems plausible we could have high confidence in when a neural net would generalize correctly. (I'm imagining that neural net theory goes from how-I-imagine-physics-looked before Newton, and the same after Newton.) 2. Use techniques like adversarial training to "robustify" the model against moderate distribution shifts (which might be sufficient to work for the next N timesteps, after which you "robustify" again). 3. Make these techniques work better through interpretability / transparency. 4. Use checks and balances. For example, if multiple generalizations are possible, train an ensemble of models and only do something if they all agree on it. Or train an actor agent combined with an overseer agent that has veto power over all actions. Or an ensemble of actors, each of which oversees the other actors and has veto power over them. These aren't "clean", in the sense that you don't get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for "all situations" that avoids the no-free-lunch theorem, though I sup
1Jack Koch4dSame, but how optimistic are you that we could figure out how to shape the motivations or internal "goals" (much more loosely defined than "mesa-objective") of our models via influencing the training objective/reward, the inductive biases of the model, the environments they're trained in, some combination of these things, etc.? Yup, if you want "clean," I agree that you'll have to either assume a distribution over possible inputs, or identify a perturbation set over possible test environments to avoid NFL.
2Rohin Shah3dThat seems great, e.g. I think by far the best thing you can do is to make sure that you finetune using a reward function / labeling process that reflects what you actually want (i.e. what people typically call "outer alignment"). I probably should have mentioned that too, I was taking it as a given but I really shouldn't have. For inductive biases + environments, I do think controlling those appropriately would be useful and I would view that as an example of (1) in my previous comment.
4Abram Demski5dBut it seems to me that there's something missing in terms of acceptability. The definition of "objective robustness" I used says "aligns with the base objective" (including off-distribution). But I think this isn't an appropriate representation of your approach. Rather, "objective robustness" has to be defined something like "generalizes acceptably". Then, ideas like adversarial training and checks and balances make sense as a part of the story. WRT your suggestions, I think there's a spectrum from "clean" to "not clean", and the ideas you propose could fall at multiple points on that spectrum (depending on how they are implemented, how much theory backs them up, etc). So, yeah, I favor "cleaner" ideas than you do, but that doesn't rule out this path for me.
2Rohin Shah4dYeah, strong +1.
2Abram Demski4dGreat! I feel like we're making progress on these basic definitions.
3Jack Koch6dShouldn’t this be “intent alignment + capability robustness or outer alignment + robustness”? Btw, I plan to post more detailed comments in response here and to your other post, just wanted to note this so hopefully there’s no confusion in interpreting your diagram.
2Abram Demski5dYep, fixed.
Load More