# All of paulfchristiano's Comments + Replies

Avoiding the instrumental policy by hiding information about humans

Unpacking "mutual information," it seems like these designs basically take the form of an adversarial game:

• The model computes some intermediate states.
• The model is trained so that the adversary can't succeed.

In the case of mutual information, this is achieved formally by having a random variable that the adversary does not observe directly. If we are talking about "what human... (read more)

A naive alignment strategy and optimism about generalization

I agree you have to do something clever to make the intended policy plausibly optimal.

The first part of my proposal in section 3 here was to avoid using "imitate humans," and to instead learn a function "Answer A is unambiguously worse than answer B." Then we update against policies only when they give unambiguously worse answers.

(I think this still has a lot of problems; it's not obvious to me whether the problem is soluble.)

I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).

I think "deferring" was a bad word for me to use. I mostly imagine the complex labeling process will just independently label data, and then only include datapoints when there is agreement. That is, you'd just always return the (simple, complex) pair, and is-correct basically just tests whether they are equal.

I said "defer" because one of the data that the complex labeling process uses may be "what a human who was in the room said," and this may sometimes be a really important source of evidence. But that really depends on how you set things up, if you hav... (read more)

1Joe_Collman22dOk, that all makes sense, thanks. So here "equal" would presumably be "essentially equal in the judgement of complex process", rather than verbatim equality of labels (the latter seems silly to me; if it's not silly I must be missing something).

I don't think anyone has a precise general definition of "answer questions honestly" (though I often consider simple examples in which the meaning is clear). But we do all understand how "imitate what a human would say" is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards "imitate what a human would say" is clearly a problem to be solved even if other concepts are philosophically ambiguous.

Sometimes a model might say something like "No one entered the datacenter" when w... (read more)

Also, I don't see what this objective has to do with learning a world model.

The idea is to address a particular reason that your learned model would "copy a human" rather than "try to answer the question well." Namely, the model already contains human-predictors, so building extra machinery to answer questions (basically translating between the world model and natural language) would be more inefficient than just using the existing human predictor. The hope is that this alternative loss allows you to use the translation machinery to compress the humans, so... (read more)

4John Schulman23dD'oh, re: the optimum of the objective, I now see that the solution is nontrivial. Here's my current understanding. Intuitively, the MAP version of the objective says: find me a simple model theta1 such that there's more-complex theta2 with high likelihood under p(theta2|theta1) (which corresponds to sampling theta2 near theta1 until theta2 satisfies head-agreement condition) and high data-likelihood p(data|theta2). And this connects to the previous argument about world models and language as follows: we want theta1 to contain half a world model, and we want theta2 to contain the full world model and high data-likelihood (for one of the head) and the two heads agree. Based on Step1, the problem is still pretty underconstrained, but maybe that's resolved in Step 2.
Response to "What does the universal prior actually look like?"

By (3) do you mean the same thing as "Simplest output channel that is controllable by advanced civilization with modest resources"?

I assume (6) means that your "anthropic update" scans across possible universes to find those that contain important decisions you might want to influence?

If you want to compare most easily to models like that, then instead of using (1)+(2)+(3) you should compare to (6') = "Simplest program that scans across many possible worlds to find those that contain some pattern that can be engineered by consequentialists trying to influe... (read more)

1michaelcohen22dYes, and then outputs strings from that set with probability proportional to their weight in the universal prior. I would say "successfully controlled" instead of controllable, although that may be what you meant by the term. (I decomposed this as controllable + making good guesses.) For some definitions of controllable, I might have given a point estimate of maybe 1 or 5 bits. But there has to be an output channel for which the way you transmit a bitstring out is the way the evolved consequentialists expect. But recasting it in these terms, implicitly makes the suggestion that the specification of the output channel can take on some of the character of (6'), makes me want to put my range down to 15-60; point estimate 25. Similarly, I would replace "can be" with "seems to have been". And just to make sure we're talking about the same thing, it takes this list of patterns, and outputs them with probability proportional to their weight in the universal prior. Yeah, this seems like it would make some significant savings compared to (1)+(2)+(3). I think replacing parts of the story from being specified as [arising from natural world dynamics] to being specified as [picked out "deliberately" by a program] generally leads to savings. I don't quite understand the sense in which [worlds with consequentialist beacons/geoglyphs] can be described as [easiest-to-specify controllable pattern]. (And if you accept the change of "can be" to "seems to have been", it propagates here). Scanning for important predictors to influence does feel very similar to me to scanning for consequentialist beacons, especially since the important worlds are plausibly the ones with consequentialists. There's a bit more work to be done in (6') besides just scanning for consequentialist beacons. If the output channel is selected "conveniently" for the consequentialists, since the program is looking for the beacons, instead of the consequentialists giving it their best guess(es) and putting up a bu
Response to "What does the universal prior actually look like?"

Here's my current understanding of your position:

1. The easiest way to specify an important prediction problem (in the sense of a prediction that would be valuable for someone to influence) is likely to be by saying "Run the following Turing machine, then pick an important decision from within it." Let's say the complexity of that specification is N bits.
2. You think that if consequentialists dedicate some fraction of their resources to doing something that's easy for the universal prior to output, it will still likely take more than N bits or not much less.
3. [Pro
1michaelcohen1moYeah, seems about right. I think with 4, I've been assuming for the sake of argument that manipulators get free access to the right prior, and I don't have a strong stance on the question, but it's not complicated for a directly programmed anthropic update to be built on that right prior too. I guess I can give some estimates for how many bits I think are required for each of the rows in the table. I'll give a point estimate, and a range for a 50% confidence interval for what my point estimate would be if I thought about it for an hour by myself and had to write up my thinking along the way. I don't have a good sense for how many bits it takes to get past things that are just extremely basic, like an empty string, or an infinite string of 0s. But whatever that number is, add it to both 1 and 6. 1) Consequentialists emerge, 10 - 50 bits; point estimate 18 2) TM output has not yet begun, 10 - 30 bits; point estimate 18 3) make good guesses about controllable output, 18 - 150 bits; point estimate 40 4) decide to output anthropically updated prior, 8 - 35 bits; point estimate 15 5) decide to do a treacherous turn. 1 - 12 bits; point estimate 5 vs. 6) direct program for anthropic update. 18-100 bits; point estimate 30 The ranges are fairly correlated.
Decoupling deliberation from competition

I agree that biological human deliberation is slow enough that it would need to happen late.

By "millennia" I mostly meant that traveling is slow (+ the social costs of delay are low, I'm estimating like 1/billionth of value per year of delay). I agree that you can start sending fast-enough-to-be-relevant ships around the singularity rather than decades later. I'd guess the main reason speed matters initially is for grabbing resources from nearby stars under whoever-gets-their-first property rights (but that we probably will move away from that regime befor... (read more)

Decoupling deliberation from competition

I think I'm basically optimistic about every option you list.

• I think space colonization is extremely slow relative to deliberation (at technological maturity I think you probably have something like million-fold speedup over flesh and blood humans, and colonization takes place over decades and millennia rather than years). Deliberation may not be "finished" until the end of the universe, but I think we will e.g. have deliberated enough to make clear agreements about space colonization / to totally obsolete existing thinking / likely to have reached a "gran
2Lukas Finnveden1moThanks, computer-speed deliberation being a lot faster than space-colonisation makes sense. I think any deliberation process that uses biological humans as a crucial input would be a lot slower, though; slow enough that it could well be faster to get started with maximally fast space colonisation. Do you agree with that? (I'm a bit surprised at the claim that colonization takes place over "millenia" at technological maturity; even if the travelling takes millenia, it's not clear to me why launching something maximally-fast – that you presumably already know how to build, at technological maturity – would take millenia. Though maybe you could argue that millenia-scale travelling time implies millenia-scale variance in your arrival-time, in which case launching decades or centuries after your competitors doesn't cost you too much expected space?) If you do agree, I'd infer that your mainline expectation is that we succesfully enforce a worldwide pause before mature space-colonisation; since the OP suggests that biological humans are likely to be a significant input into the deliberation process, and since you think that the beaming-out-info schemes are pretty unlikely. (I take your point that as far as space-colonisation is concerned; such a pause probably isn't strictly necessary.)
Decoupling deliberation from competition

I would rate "value lost to bad deliberation" ("deliberation" broadly construed, and including easy+hard problems and individual+collective failures) as comparably important to "AI alignment." But I'd guess the total amount of investment in the problem is 1-2 orders of magnitude lower, so there is a strong prima facie case for longtermists prioritizing it.

Overall I think I'm quite a bit more optimistic than you are, and would prioritize these problems less than you would, but still agree directionally that these problems are surprisingly neglected (and I could imagine them playing more to the comparative advantages/interests of longermists and the LW crowd than topics like AI alignment).

Decoupling deliberation from competition

What if our "deliberation" only made it as far as it did because of "competition", and that nobody or very few people knows how to deliberate correctly in the absence of competitive pressures? Basically, our current epistemic norms/practices came from the European Enlightenment, and they were spread largely via conquest or people adopting them to avoid being conquered or to compete in terms of living standards, etc. It seems that in the absence of strong competitive pressures of a certain kind, societies can quickly backslide or drift randomly in terms of

4Wei Dai23dAs another symptom what's happening [https://0bin.net/paste/X3kZhXnJ#dMDp-6Uqn1pfIfEu3/m/xfiHMRfh5szUIis09vG483w] (the rest of this comment is in a "paste" that will expire in about a month, to reduce the risk of it being used against me in the future)

Here's an idea of how random drift of epistemic norms and practices can occur. Beliefs (including beliefs about normative epistemology) function in part as a signaling device, similar to clothes. (I forgot where I came across this idea originally, but a search produced a Robin Hanson article about it.) The social dynamics around this kind of signaling produces random drift in epistemic norms and practices, similar to random drift in fashion / clothing styles. Such drift coupled with certain kinds of competition could have produced the world we have today (... (read more)

We’ve talked about this a few times but I still don’t really feel like there’s much empirical support for the kind of permanent backsliding you’re concerned about being widespread.

I'm not claiming direct empirical support for permanent backsliding. That seems hard to come by, given that we can't see into the far future. I am observing quite severe current backsliding. For example, explicit ad hominem attacks, as well as implicitly weighing people's ideas/arguments/evidence differently, based on things like the speaker's race and sex, have become the nor... (read more)

Finite Factored Sets

Agree it's not totally right to call this a causal relationship.

That said:

• The contents of 3 envelopes does seems causally upstream of the contents of 10 envelopes
• If Alice's perception is imperfect (in any possible world), then "what Alice perceived" is not identical to "the contents of 3 envelopes" and so is not strictly before "what Bob perceived" (unless there is some other relationship between them).
• If Alice's perception is perfect in every possible world, then there is no possible way to intervene on Alice's perception without intervening on the conten
2Vladimir Slepnev1moI feel that interpreting "strictly before" as causality is making me more confused. For example, here's a scenario with a randomly changed message. Bob peeks at ten regular envelopes and a special envelope that gives him a random boolean. Then Bob tells Alice the contents of either the first three envelopes or the second three, depending on the boolean. Now Alice's knowledge depends on six out of ten regular envelopes and the special one, so it's still "strictly before" Bob's knowledge. And since Alice's knowledge can be computed from Bob's knowledge but not vice versa, in FFS terms that means the "cause" can be (and in fact is) computed from the "effect", but not vice versa. My causal intuition is just blinking at all this. Here's another scenario. Alice gets three regular envelopes and accurately reports their contents to Bob, and a special envelope that she keeps to herself. Then Bob peeks at seven more envelopes. Now Alice's knowledge isn't "before" Bob's, but if later Alice predictably forgets the contents of her special envelope, her knowledge becomes "before" Bob's. Even though the special envelope had no effect on the information Alice gave to Bob, didn't affect the causal arrow in any possible world. And if we insist that FFS=causality, then by forgetting the envelope, Alice travels back in time to become the cause of Bob's knowledge in the past. That's pretty exotic.

I think I (at least locally) endorse this view, and I think it is also a good pointer to what  seems to me to be the largest crux between the my theory of time and Pearl's theory of time.

Finite Factored Sets

I agree that bipartite graphs are only a natural way of thinking about it if you are starting from Pearl. I'm not sure anything in the framework is really properly analogous to the DAG in a causal model.

3Koen Holtman1moMy thoughts on naming this finite factored sets: I agree with Paul's observation that | Factorization seems analogous to describing a world as a set of variables By calling this 'finite factored sets', you are emphasizing the process of coming up with individual random variables, the variables that end up being the (names of the) nodes in a causal graph. With s∈S representing the entire observable 4D history of a world (like a computation starting from a single game of life board state), a factorisation B={b1,b2,⋯bn} splits such s into a tuple of separate, more basic observables (bb1,bb2,⋯,bbn). where bb1∈b1, etc. In the normal narrative that explains Pearl causal graphs, this splitting of the world into smaller observables is not emphasized. Also, the splitting does not necessarily need to be a bijection. It may loose descriptive information with respect to s. So I see the naming finite factored sets as a way to draw attention to this splitting step, it draws attention to the fact that if you split things differently, you may end up with very different causal graphs. This leaves open the question of course is if really want to name your framework in a way that draws attention to this part of the process. Definitely you spend a lot of time on creating an equivalent to the arrows between the nodes too.
Mundane solutions to exotic problems

It's not as clear to me that epistemic competitiveness + a normal training loop suffices to penalize gradient hacking. Would need to think about it more. Whereas I feel reasonably comfortable saying that if the model is able to alter its behavior because it believes that it will make SGD work worse, then a competitive overseer is able to use the same information to make SGD work better. (Though I can't be that comfortable about anything given how shaky the abstractions are etc., and mostly expect to revisit with a clearer sense of what epistemic competitiveness means.)

Finite Factored Sets

I think FFS makes sense as an analog of DAG, and it seems reasonable to think of the normal model as DAG time and this model as FFS time. I think the name made me a bit confused by calling attention to one particular diff between this model and Pearl (factored sets vs variables), whereas I actually feel like that diff was basically a red herring and it would have been fastest to understand if the presentation had gone in the opposite direction by demphasizing that diff (e.g. by presenting the framework with variables instead of factors).

Makes sense. I think a bit of my naming and presentation was biased by being so surprised by the not on OEIS fact.

I think I disagree about the bipartite graph thing. I think it only feels more natural when comparing to Pearl. The talk frames everything in comparison to Pearl, but I think if you are not looking at Pearl, I think graphs don’t feel like the right representation here. Comparing to Pearl is obviously super important, and maybe the first introduction should just be about the path from Pearl to FFS, but once we are working within the FFS ontology... (read more)

Finite Factored Sets

Here is how I'm currently thinking about this framework and especially inference, in case it's helpful for other folks who have similar priors to mine (or in case something is still wrong).

A description of traditional causal models:

• A causal graph with N nodes can be viewed as a model with 2N variables, one for each node of the graph and a corresponding noise variable for each. Each real variable is a deterministic function of its corresponding noise variable + its parents in the causal graph.
• When we talk about causal inference, we often consider probabilit

I think the definition of history is the most natural way to recover something like causal structure in these models.

I'm not sure how much it's about causality. Imagine there's a bunch of envelopes with numbers inside, and one of the following happens:

1. Alice peeks at three envelopes. Bob peeks at ten, which include Alice's three.

2. Alice peeks at three envelopes and tells the results to Bob, who then peeks at seven more.

3. Bob peeks at ten envelopes, then tells Alice the contents of three of them.

Under the FFS definition, Alice's knowledge in each ... (read more)

4Scott Garrabrant1moThanks Paul, this seems really helpful. As for the name I feel like "FFS" is a good name for the analog of "DAG", which also doesn't communicate that much of the intuition, but maybe doesn't make as much sense for name of the framework.
Response to "What does the universal prior actually look like?"

bits to specify camera on earth - bits saved from anthropic update

I think the relevant number is just "log_2 of the number of predictions that the manipulators want to influence." It seems tricky to think about this (rather small) number as the difference between two (giant) numbers.

So they have to specify the random camera on an earth-like Turing machine too.

They are just looking at the earth-like Turing machine, looking for the inductors whose predictions are important,  and then trying to copy those input sequences. This seems mostly unrelated to t... (read more)

1michaelcohen1moI’m talking about the weight of an anthropically updated prior within the universal prior. I should have added “+ bits to encode anthropic update directly” to that side of the equation. That is, it takes some number of bits to encode “the universal prior, but conditioned on the strings being important to decision-makers in important worlds”. I don’t know how to encode this, but there is presumably a relatively simple direct encoding, since it’s a relatively simple concept. This is what I was talking about in my response to the section “The competition”. One way that might be helpful about thinking about the bits saved from the anthropic update is that it is−logprobstring∼universal prior(string is important to decision-makers in important worlds). I think this gives us a handle in reasoning about anthropic savings as a self-contained object, even if it’s a big number. But suppose they picked only one string to try to manipulate. The cost would go way down, but then it probably wouldn’t be us that they hit. If log of the number of predictions that the manipulators want to influence is 7 bits shorter than [bits to specify camera on earth - bits saved from anthropic update], then there’s a 99% chance we’re okay. If different manipulators in different worlds are choosing differently, we can expect 1% of them to choose our world, and so we start worrying again, but we add the 7 bits back because it’s only 1% of them. So let’s consider two Turing machines. Each row will have a cost in bits. A B Consequentialists emerge, Directly programmed anthropic update. make good guesses about controllable output, decide to output anthropically updated prior. Weight of earth-camera within anthropically updated prior The last point can be decomposed into [description length of camera in our world - anthropic savings], but it doesn’t matter; it appears in both options. I don’t think this is what you have in mind, but I’ll add another case, in case this is what you meant by “The
Mundane solutions to exotic problems

leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients

This seems slightly confusing/unclear---I'm not imagining penalizing the model for trying to hack the gradients, I'm imagining changing the loss in a way that blocks the attempted gradient hacking. E.g. the model knows that parameters  are in the direction of more aligned models, and it could hijack the training process by ensuring that  gets a high loss. So it tries to behave badly when its own parameters are , trying t... (read more)

2Rohin Shah1moAh, I see, that makes sense. I had in fact misunderstood what you were saying here. That being said, why not penalize the model when it engages in gradient hacking, so that it becomes less likely to do it in the future?
Response to "What does the universal prior actually look like?"

They are using their highest probability guess about the output channel, which will be higher probability than the output channel exactly matching some camera on old earth (but may still be very low probability). I still don't understand the relevance.

I'm probably going to give up soon, but there was one hint about a possible miscommunication:

Suppose they want the first N bits of the output of their Turing machine to obey predicate P, and they assign that a value of 100

They don't care about "their" Turing machine, indeed they live in an infinite number of ... (read more)

1michaelcohen1moI’m trying to find the simplest setting where we have a disagreement. We don’t need to think about cameras on earth quite yet. I understand the relevance isn’t immediate. I think I see the distinction between the frameworks we most naturally think about the situation. I agree that they live in an infinite number of Turing machines, in the sense that their conscious patterns appear in many different Turing machines. All of these Turing machines have weight in some prior. When they change their behavior, they (potentially) change the outputs of any of these Turing machines. Taking these Turing machines as a set, weighted by those prior weights we can consider the probability that the output obeys a predicate P. The answer to this question can be arrived at through an equivalent process. Let the inhabitants imagine that there is a correct answer to the question “which Turing machine do I really live in?” They then reason anthropically about which Turing machines give rise to such conscious experiences as theirs. They then use the same prior over Turing machines that I described above. And then they make the same calculation about the probability that “their” Turing machine outputs something that obeys the predicate P. So on the one hand, we could say that we are asking “what is the probability that the section of the universal prior which gives rise to these inhabitants outputs an output that obeys predicate P?” Or we could equivalently ask “what is the probability that this inhabitant ascribes to ‘its’ Turing machine outputting a string that obeys predicate P?” There are facts that I find much easier to incorporate when thinking in the latter framework, such as “a work tape inhabitant knows nothing about the behavior of its Turing machine’s output tape, except that it has relative simplicity given the world that it knows.” (If it believes that its conscious existence depends on its Turing machine never having output a bit that differs from a data stream in a base wo
Response to "What does the universal prior actually look like?"

Someone in the basement universe is reasoning about the output of a randomized Turing machine that I'm running on.

I care about what they believe about that Turing machine. Namely, I want them to believe that most of the time when the sequence x appears, it is followed by a 1.

Their beliefs depend in a linear way on my probabilities of action.

(At least if e.g. I committed to that policy at an early enough time for them to reason about it, or if my policy is sufficiently predictable to be correlated with their predictions, or if they are able to actually simu... (read more)

Response to "What does the universal prior actually look like?"

I don't think I understand what you mean. Their goal is to increase the probability of the sequence x+1, so that someone who has observed the sequence x will predict 1.

What do you mean when you say "What about in the case where they don't know"?

I agree that under your prior, someone has no way to increase e.g. the fraction of sequences in the universal prior that start with 1 (or the fraction of 1s in a typical sequence under the universal prior, or any other property that is antisymmetric under exchange of 0 and 1).

1michaelcohen1moOkay, now suppose they want the first N bits of the output of their Turing machine to obey predicate P, and they assign that a value of 100, and a they assign a value of 0 to any N-bit string that does not obey predicate P. And they don't value anything else. If some actions have a higher value than other actions, what information about the output tape dynamics are they using, and how did they acquire it?
Response to "What does the universal prior actually look like?"

I don't agree, but I may still misunderstand something. Stepping back to the beginning:

Suppose they know the sequence that actually gets fed to the camera. It is x= 010...011.

They want to make the next bit 1. That is, they want to maximize the probability of the sequence (x+1)=010...0111.

They have developed a plan for controlling an output channel to get it to output (x+1).

For concreteness imagine that they did this by somehow encoding x+1 in a sequence of ultra high-energy photons sent in a particular direction. Maybe they encode 1 as a photon with freque... (read more)

1michaelcohen1moIf you're saying that they know their Turing machine has output x so far, then I 100% agree. What about in the case where they don't know?
Response to "What does the universal prior actually look like?"

To express my confusion more precisely:

I feel like this story has run aground on an impossibility result. If a random variable’s value is unknowable (but its distribution is known) and an intelligent agent wants to act on its value, and they randomize their actions, the expected log probability of them acting on the true value cannot exceed the entropy of the distribution, no matter their intelligence.

I think that's right (other than the fact that they can win simultaneously for many different output rules, but I'm happy ignoring that for now). But I don't... (read more)

Response to "What does the universal prior actually look like?"

I agree that randomization reduces the "upside" in the sense of "reducing our weight in the universal prior." But utility is not linear in that weight.

I'm saying that the consequentialists completely dominate the universal prior, and they will still completely dominate if you reduce their weight by 2x. So either way they get all the influence. (Quantitatively, suppose the consequentialists currently have probability 1000 times greater than the intended model. Then they have 99.9% of the posterior. If they decreased their probability of acting by two, then ... (read more)

1michaelcohen1moIf I flip a coin to randomize between two policies, I don't see how that mixed policy could produce more value for me than the base policies. (ETA: the logical implications about the fact of my randomization don't have any weird anti-adversarial effects here).
Response to "What does the universal prior actually look like?"

I currently don't understand the information-theoretic argument at all (and feels like it must come down to some kind of miscommunication), so it seems easiest to talk about how the impossibility argument applies to the situation being discussed.

If we want to instead engage on the abstract argument, I think it would be helpful to me to present it as a series of steps that ends up saying "And that's why the consequentialists can't have any influence." I think the key place I get lost is the connection between the math you are saying and a conclusion about the influence that the consequentialists have.

1michaelcohen1moIf these consequentialists ascribed a value of 100 to the next output bit being 1, and a value of 0 to the next output bit being 0, and they valued nothing else, would you agree that all actions available to them have identical expected value under the distribution over Turing machines that I have described?
Response to "What does the universal prior actually look like?"

I don’t think it’s just like saying that...

I didn't quite get this, so let me try restating what I mean.

Let's say the states and rules for manipulating the worktapes are totally fixed and known, and we're just uncertain about the rules for outputting something to the output tape.

Zero of these correspond to reading off the bits from a camera (or dataset) embedded in the world. Any output rule that lets you read off precisely the bits from the camera is going to involving adding a bunch of new states to the Turing machine.

1michaelcohen1moI take your point that we are discussing some output rules which add extra computation states, and so some output rules will add fewer computation states than others. I'm merging my response to the rest with my comment here [https://www.lesswrong.com/posts/n2Gseb3XFpMyc2FEb/response-to-what-does-the-universal-prior-actually-look-like?commentId=QHQG8Wso6wqEy2m8b] .
Response to "What does the universal prior actually look like?"

I basically agree that if the civilization has a really good grasp of the situation, and in particular has no subjective uncertainty (merely uncertainty over which particular TM they are), then they can do even better by just focusing their effort on the single best set of channels rather than randomizing.

(Randomization is still relevant for reducing the cost to them though.)

1michaelcohen1moWith randomization, you reduce the cost and the upside in concert. If a pair of shoes costs $100, and that's more than I'm willing to pay, I could buy the shoes with probability 1%, and it will only cost me$1 in expectation, but I will only get the shoes with probability 1/100.
Response to "What does the universal prior actually look like?"

I'm imagining that the consequentialists care about something, like e.g. human flourishing. They think that they could use their control over the universal prior to achieve more of what they care about, i.e. by achieving a bunch of human flourishing in some other universe where someone thinks about the universal prior. Randomizing is one strategy available to them to do that.

So I'm saying that I expect they will do better---i.e. get more influence over the outside world (per unit of cost paid in their world)---than if they had simply randomized. That's bec... (read more)

3michaelcohen1moIt's definitely not too weird a possibility for me. I'm trying to reason backwards here--the best strategy available to them can't be effective in expectation at achieving whatever their goals are with the output tape, because of information-theoretic impossibilities, and therefore, any given strategy will be that bad or worse, including randomization.
2Paul Christiano1moI basically agree that if the civilization has a really good grasp of the situation, and in particular has no subjective uncertainty (merely uncertainty over which particular TM they are), then they can do even better by just focusing their effort on the single best set of channels rather than randomizing. (Randomization is still relevant for reducing the cost to them though.)
Response to "What does the universal prior actually look like?"

In this story, I'm imagining that hypotheses like "simulate simple physics, start reading from simple location" lose, but similar hypotheses like  "simulate simple physics, start reading from simple location after a long delay" (or after seeing pattern X, or whatever) could be among the output channels that we consider manipulating. Those would also eventually get falsified (if we wanted to deliberately make bad predictions in order to influence the basement world where someone is thinking about the universal prior) but not until a critical prediction that we wanted to influence.

Response to "What does the universal prior actually look like?"

I hope that most of your comments are cleared up by the story. But some line by line comments in case they help:

affecting the world in which the Turing machine is being run

I'm talking about what the actual real universal prior looks like rather than some approximation, and no one is actually running all of the relevant Turing machines. I'm imagining this whole exercise  being relevant in the context of systems that perform abstract reasoning about features of the universal prior (e.g. to make decisions on the basis of their best guesses about the post... (read more)

1michaelcohen1moWe can get back to some of these points as needed, but I think our main thread is with your other comment, and I'll resist the urge to start a long tangent about the metaphysics of being "simulated" vs. "imagined".
Response to "What does the universal prior actually look like?"

I think the original post was pretty unclear (I was even more confusing 5 years ago than I am now) and it's probably worthwhile turning it into a more concrete/vivid scenario. Hopefully that will make it easier to talk about any remaining disagreements, and also will make the original post clearer to other folks (I think most people bounce off of it in its current form).

To make things more vivid I'll try describe what the world might look like from our perspective if we came to believe that we were living inside the imagination of someone thinking about th... (read more)

1michaelcohen1moI feel like this story has run aground on an impossibility result. If a random variable’s value is unknowable (but its distribution is known) and an intelligent agent wants to act on its value, and they randomize their actions, the expected log probability of them acting on the true value cannot exceed the entropy of the distribution, no matter their intelligence. (And if they’re wrong about the r.v.’s distribution, they do even worse). But lets assume they are correct. They know there are, say, 80 output instructions (two work tapes and one input tape, and a binary alphabet, and 10 computation states). And each one has a 1/3 chance of being “write 0 and move”, “write 1 and move”, or “do nothing”. Let’s assume they know the rules governing the other tape heads, and the identity of the computation states (up to permutation). Their belief distribution is (at best) uniform over these 3^80 possibilities. Is computation state 7 where most of the writing gets done? They just don’t know. It doesn’t matter if they’ve figured out that computation state 7 is responsible for the high-level organization of the work tapes. It’s totally independent. Making beacons is like assuming that computation state 7, so important for the dynamics of their world, has anything special to do with the output behavior. (Because what is a beacon if not something that speaks to internally important computation states?) That’s all going along with the premise that when consequentialists face uncertainty, they flip a coin, and adopt certainty based on the outcome. So if they think it’s 50/50 whether a 0 or a 1 gets output, they flip a coin or look at some tea leaves, and then act going forward as if they just learned the answer. Then, it only costs 1 bit to say they decided “0”. But I think getting consequentialists to behave this way requires an intervention into their proverbial prefrontal cortices. If these consequentialists were playing a bandit game, and one arm gave a reward of 0.9 with certa
0Signer1moTo clarify, sufficient observations would still falsify all "simulate simple physics, start reading from simple location" programs and eventually promote "simulate true physics, start reading from camera location"?
Knowledge Neurons in Pretrained Transformers

I'm inclined to be more skeptical of these results.

I agree that this paper demonstrates that it's possible to interfere with a small number of neurons in order to mess up retrieval of a particular fact (roughly 6 out of the 40k mlp neurons if I understand correctly), which definitely tells you something about what the model is doing.

But beyond that I think the inferences are dicier:

• Knowledge neurons don't seem to include all of the model's knowledge about a given question. Cutting them out only decreases the probability on the correct answer by 40%. This m
5Evan Hubinger1moYeah, agreed—though I would still say that finding the first ~40% of where knowledge of a particular fact is stored counts as progress (though I'm not saying they have necessarily done that). That's a good point—I didn't look super carefully at their number there, but I agree that looking more carefully it does seem rather large. I also thought this was somewhat strange and am not sure what to make of it. I was also surprised that they used individual neurons rather than NMF factors or something—though the fact that it still worked while just using the neuron basis seems like more evidence that the effect is real rather than less. Perhaps I'm too trusting—I agree that everything you're describing seems possible given just the evidence in the paper. All of this is testable, though, and suggests obvious future directions that seem worth exploring.
Agency in Conway’s Game of Life

I guess the important part of the hamiltonian construction may be just having the next state depend on x(t) and x(t-1) (apparently those are called second-order cellular automata). Once you do that it's relatively easy to make them reversible, you just need the dependence of x(t+1) on x(t-1) to be a permutation. But I don't know whether using finite differences for the hamiltonian will easily give you conservation of momentum + energy in the same way that it would with derivatives.

Agency in Conway’s Game of Life

I feel like they must exist (and there may not be that many simple nice ones). I expect someone who knows more physics could design them more easily.

My best guess would be to get both properties by defining the system via some kind of discrete hamiltonian. I don't know how that works, i.e. if there is a way of making the hamiltonian discrete (in time and in values of the CA) that still gives you both properties and is generally nice. I would guess there is and that people have written papers about it. But it also seems like that could easily fail in one wa... (read more)

4Paul Christiano1moI guess the important part of the hamiltonian construction may be just having the next state depend on x(t) and x(t-1) (apparently those are called second-order cellular automata [https://en.wikipedia.org/wiki/Second-order_cellular_automaton]). Once you do that it's relatively easy to make them reversible, you just need the dependence of x(t+1) on x(t-1) to be a permutation. But I don't know whether using finite differences for the hamiltonian will easily give you conservation of momentum + energy in the same way that it would with derivatives.
Agency in Conway’s Game of Life

It seems like our physics has a few fundamental characteristics that change the flavor of the question:

• Reversibility. This implies that the task must be impossible on average---you can only succeed under some assumption about the environment (e.g. sparsity).
• Conservation of energy/mass/momentum (which seem fundamental to the way we build and defend structures in our world).

I think this is an interesting question, but if poking around it would probably be nicer to work with simple rules that share (at least) these features of physics.

2Alex Flint1moYeah I agree. There was a bit of discussion re conservation of energy here [https://www.lesswrong.com/posts/3SG4WbNPoP8fsuZgs/agency-in-conway-s-game-of-life?commentId=uiDy3TnHdrcpdCM4r] too. I do like thought experiments in cellular automata because of the spatially localized nature of the transition function, which matches our physics. Do you have any suggestions for automata that also have reversibility and conservation of energy?
Low-stakes alignment

I was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization.

This certainly raises a lot of questions though---what form do these states take? How do I specify a reward function that takes as input a state of the world?

I agree that "actually trying" is still hard to define, though you could avoid that messiness by saying that the

2Rohin Shah2moYeah, all of that seems right to me (and I feel like I have a better understanding of why assumptions on inputs are better than assumptions on outputs, which was more like a vague intuition before). I've changed the opinion to:
Low-stakes alignment

Sounds good/accurate.

It seems like there are other ways to get similarly clean subproblems, like "assume that the AI system is trying to optimize the true reward function".

My problem with this formulation is that it's unclear what "assume that the AI system is trying to optimize the true reward function" means---e.g. what happens when there are multiple reasonable generalizations of the reward function from the training distribution to a novel input?

I guess the natural definition is that we actually give the algorithm designer a separate channel to specify... (read more)

2Rohin Shah2moI was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization. It feels like the low-stakes setting is also mostly assuming away embeddedness problems? I suppose it still includes e.g. cases where the AI system subtly changes the designer's preferences over the course of training, but it excludes e.g. direct modification of the reward, taking over the training process, etc. I agree that "actually trying" is still hard to define, though you could avoid that messiness by saying that the goal is to provide a reward such that any optimal policy for that reward would be beneficial / aligned (and then the assumption is that a policy that is "actually trying" to pursue the objective would not do as well as the optimal policy but would not be catastrophically bad). Just to reiterate, I agree that the low-stakes formulation is better; I just think that my reasons for believing that are different from "it's a clean subproblem". My reason for liking it is that it doesn't require you to specify a perfect reward function upfront, only a reward function that is "good enough", i.e. it incentivizes the right behavior on the examples on which the agent is actually trained. (There might be other reasons too that I'm failing to think of now.)
AMA: Paul Christiano, alignment researcher

I think most people have expectations regarding e.g. how explicitly will systems represent their preferences, how much will they have preferences, how will that relate to optimization objectives used in ML training, how well will they be understood by humans, etc.

Then there's a bunch of different things you might want: articulations of particular views on some of those questions, stories that (in virtue of being concrete) show a whole set of guesses and how they can lead to a bad or good outcome, etc. My bullet points were mostly regarding the exercise of fleshing out a particular story (which is therefore most likely to be wrong), rather than e.g. thinking about particular questions about the future.

AMA: Paul Christiano, alignment researcher

Don't read too much into it. I do dislike Boston weather.

AMA: Paul Christiano, alignment researcher

On that perspective I guess by default I'd think of a threat as something like "This particular team of hackers with this particular motive" and a threat model as something like "Maybe they have one or two zero days, their goal is DoS or exfiltrating information, they may have an internal collaborator but not one with admin privileges..." And then the number of possible threat models is vast even compared to the vast space of threats.

AMA: Paul Christiano, alignment researcher

I mostly don't think this thing is a major issue. I'm not exactly sure where I disagree, but some possibilities:

• H isn't some human isolated from the world, it's an actual process we are implementing (analogous to the current workflow involving external contractors, lots of discussion about the labeling process and what values it might reflect, discussions between contractors and people who are structuring the model, discussions about cases where people disagree)
• I don't think H is really generalizing OOD, you are actually collecting human data on the kinds
2Joe_Collman2moThanks, that's very helpful. It still feels to me like there's a significant issue here, but I need to think more. At present I'm too confused to get much beyond handwaving. A few immediate thoughts (mainly for clarification; not sure anything here merits response): * I had been thinking too much lately of [isolated human] rather than [human process]. * I agree the issue I want to point to isn't precisely OOD generalisation. Rather it's that the training data won't be representative of the thing you'd like the system to learn: you want to convey X, and you actually convey [output of human process aiming to convey X]. I'm worried not about bias in the communication of X, but about properties of the generating process that can be inferred from the patterns of that bias. * It does seem hard to ensure you don't end up OOD in a significant sense. E.g. if the content of a post-deployment question can sometimes be used to infer information about the questioner's resource levels or motives. * The opportunity costs I was thinking about were in altruistic terms: where H has huge computational resources, or the questioner has huge resources to act in the world, [the most beneficial information H can provide] would often be better for the world than [good direct answer to the question]. More [persuasion by ML] than [extortion by ML]. * If (part of) H would ever ideally like to use resources to output [beneficial information], but gives direct answers in order not to get thrown off the project, then (part of) H is deceptively aligned. Learning from a (partially) deceptively aligned process seems unsafe. * W.r.t. H's making value calls, my worry isn't that they're asked to make value calls, but that every decision is an implicit value call (when you can respond with free text, at least). I'm going to try writing up the core of my worry in more precise terms. It's still very possible that any non-tri
AMA: Paul Christiano, alignment researcher

No idea other than playing a bunch of games (might as well current version, old dailies probably best) and maybe looking at solutions when you get stuck. Might also just run through a bunch of games and highlight the main important interactions and themes for each of them, e.g. Innovation + Public Works + Reverberate or Hatchery + Till. I think on any given board (and for the game in general) it's best to work backwards from win conditions, then midgames, and then openings.

AMA: Paul Christiano, alignment researcher

We'll do the cost-benefit analysis and over time it will look like a good career for a smaller and smaller fraction of people (until eventually basically everyone for whom it looks like a good idea is already doing it).

That could kind of qualitatively look like "something else is more important," or "things kind of seem under control and it's getting crowded," or "there's no longer enough money to fund scaleup." Of those, I expect "something else is more important" to be the first to go (though it depends a bit on how broadly you interpret "from AI," if an... (read more)

AMA: Paul Christiano, alignment researcher

I've created 3 blogs in the last 10 years and 1 blog in the preceding 5 years. It seems like 1-2 is a good guess. (A lot depends on whether there ends up being an ARC blog or it just inherits ai-alignment.com)

AMA: Paul Christiano, alignment researcher

I mostly found myself more agreeing with Robin, in that e.g. I believe previous technical change is mostly a good reference class, that Eliezer's AI-specific arguments are mostly kind of weak. (I liked the image, I think from that debate, of a blacksmith emerging into the townsquare with his mighty industry and making all bow before them.)

That said, I think Robin's quantitative estimates/forecasts are pretty off and usually not very justified, and I think he puts too much stock on an outside view extrapolation from past transitions rather than looking at t... (read more)

3Rob Bensinger2moSource for the blacksmith analogy: I Still Don't Get Foom [https://www.overcomingbias.com/2014/07/30855.html]
2Ben Pace2moNoted. (If both parties are interested in that debate I’m more than happy to organize it in whatever medium and do any work like record+transcripts or book an in-person event space.)
AMA: Paul Christiano, alignment researcher

Dunno, would be nice to figure out how useful this AMA was for other people. My guess is that they should at some rate/scale (in combination with other approaches like going on a podcast or writing papers or writing informal blog posts), and the question is how much communication like that to do in an absolute sense and how much should be AMAs vs other things.

Maybe I'd guess that typically like 1% of public communication should be something like an AMA, and that something like 5-10% of researcher time should be public communication (though as mentioned in ... (read more)

AMA: Paul Christiano, alignment researcher

Change is slow and hard and usually driven by organic changes rather than clever ideas, and I expect it to be the same here.

In terms of why the idea is actually just not that big a deal, I think the big thing is that altruistic projects often do benefit hugely from not needing to do explicit credit attribution. So that's a real cost. (It's also a cost for for-profit businesses, leading to lots of acrimony and bargaining losses.)

They also aren't quite consistent with moral public goods / donation-matching, which might be handled better by a messy status quo... (read more)

AMA: Paul Christiano, alignment researcher

The boxes at the top haven't really changed. The boxes at the bottom never felt that great, it still seems like a fine way for them to be---I expect they would change if I did it again but I wouldn't feel any better about the change than I did about the initial or final version.

AMA: Paul Christiano, alignment researcher

More true beliefs (including especially about large numbers of messy details rather than a few central claims that can receive a lot of attention).