Your summaries are excellent Rohin. This looks good to me.

I think that I should modify 5a from "Search for a proof that this sentence is consistent with your model of the world, up to a maximum proof length of one million characters" to "Search to a proof of this sentence, using your model of the world as a set of starting assumptions". This is indeed a significant change to the algorithm and I thank you for pointing it out. I think that would resolve your second concern, about the problem with 5a itself, yes?

I think it might also resolve your first concern, about the unsoundness of the logical system, because th... (read more)

Firstly, the agent's logical deduction system is unsound. It includes something comparable with Peano arithmetic (or else Löb's theorem can't be applied), and then adds a deduction rule "if P can be proved consistent with the system-so-far then assume P is true". But we already know that for any consistent extension T of Peano arithmetic there is at least one proposition G for which both G and ~G are consistent with T. So both of these are deducible using the rule. Now, the agent might not find the contradiction, because...

Are you referring to step 5a/5... (read more)

0JBlack23dYes, I'm referring to 5a/5b in conjunction with self-reflection as a deduction rule, since the agent is described as being able to use these to derive new propositions. There is also a serious problem with 5a itself: the agent needs to try to prove that some new proposition P is "consistent with its model of the world". That is, for its existing axioms and derivation rules T, prove that T+P does not derive a contradiction. If T is consistent, then this is impossible by Gödel's second incompleteness theorem. Hence for any P, Step 5a will always exhaust all possible proofs of up to whatever bound and return without finding such a proof. Otherwise T is inconsistent and it may be able to find such a proof, but it is also obvious that its proofs can't be trusted and not at all surprising that it will take the wrong route.
It's actually written, just need to edit and post. Should be very soon. Thanks for checking on it.

Yep, agreed.

These are still writing that I drafted before we chatted a couple of weeks ago btw. I have some new ideas based on the things we chatted about that I hope to write up soon :)

Well here is a thought: a random string would have high Kolmogorov complexity, as would a string describing the most fundamental laws of physics. What are the characteristics of the latter that conveys power over one's environment to an agent that receives it, that is not conveyed by the former? This is the core question I'm most interested in at the moment.

Well yes I agree that knowledge exists with respect to a goal, but is there really no objective difference an alien artifact inscribed with deep facts about the structure of the universe and set up in such a way that it can be decoded by any intelligent species that might find it, and an ordinary chunk of rock arriving from outer space?

1Chris_Leong2moWell, taking the simpler case of exacting reproducing a certain string, you could find the simplest program that produces the string similar to Kolmogorov complexity [] and use that as a measure of complexity. A slightly more useful way of modelling things may be to have a bunch of different strings with different points representing levels of importance. And perhaps we produce a metric combining the Kolmovorov complexity of a decoder with the sum of the points produced where points are obtained by concatenating the desired strings with a predefined separator. For example, we might find the quotient. One immediate issue with this is that some of the strings may contain overlapping information. And we'd still have to produce a metric to assign importances to the strings. Perhaps a simpler case would be where the strings represent patterns in a stream via encoding a Turing machine with the Turing machines being able to output sets of symbols instead of just symbols representing the possible symbols at each locations. And the amount of points they provide would be equal to how much of the stream it allows you to predict. (This would still require producing a representation of the universe where the amount of the stream predicted would be roughly equivalent to how useful the predictions are). Any thoughts on this general approach?
Thank you for this comment duck_master.

I take your point that it is possible to extract knowledge about human affairs, and about many other things, from the quantum structure of a rock that has been orbiting the Earth. However, I am interested in a definition of knowledge that allows me to say what a given AI does or does not know, insofar as it has the capacity to act on this knowledge. For example, I would like to know whether my robot vacuum has acquired sophisticated knowledge of human psychology, since if it has, and I wasn't expecting it to, then I m... (read more)

Thank you for this thoughtful comment itaibn0.

Matter and energy and also approximately homogeneously distributed in our own physical universe, yet building a small device that expands its influence over time and eventually rearranges the cosmos into a non-trivial pattern would seem to require something like an AI.

It might be that the same feat can be accomplished in Life using a pattern that is quite unintelligent. In that case I am very interested in what it is about our own physical universe that makes it different in this respect from Life.

Now it could ... (read more)

Thank you for the kind words Jemist.

Yeah I'm open to improvements upon the use of the word "knowledge" because you're right that what I'm describing here isn't quite what either philosophers or cognitive scientists refer to as knowledge.

Yes knowledge-accumulating systems do seem to be a special case of optimizing systems. It may be that among all optimizing systems, it is precisely the ones that accumulate knowledge in the process of optimization that are of most interest to us from an alignment perspective, because knowledge-accumulating optimizing systems are (perhaps) the most powerful of all optimizing systems.

Dang, the images in this post are totally off. I have a script that converts a google doc to markdown, then I proofread the markdown, but the images don't show up in the editor, and it looks like my script is off. Will fix tomorrow.

Update: fixed

Yeah I had the sense that the project could have been intended as a compression mechanism since compressing in terms of CA rules kind of captures the spatial nature of image information quite well.

2Daniel Kokotajlo2moI wonder if there are some sorts of images that are really hard to compress via this particular method. I wonder if you can achieve massive reliable compression if you aren't trying to target a specific image but rather something in a general category. For example, maybe this specific lizard image requires a CA rule filesize larger than the image to express, but in the space of all possible lizard images there are some nice looking lizards that are super compressible via this CA method. Perhaps using something like DALL-E we could search this space efficiently and find such an image.
Yes - I found that work totally wild. Yes they are setting up a cellular automata in such a way that it evolves towards and then fixates at a target state, but iirc what they are optimizing over is the rules of the automata itself, rather than over a construction within the automata.

2Daniel Kokotajlo2moWow, that's cool! Any idea how complex (how large the filesize) the learned CA's rules were? I wonder how it compares to the filesize of the target image. Many order of magnitude bigger? Just one? Could it even be... smaller?
Yeah this seems right to me.

Thank you for all the summarization work you do, Rohin.

Somewhat ironically, some of these failures from thinking of oneself or others as agents causes a lack of agency! Maybe this is just a trick of language, but here's what I have in mind from thinking about some of the pitfalls:

Yeah right, I agree with those three bullet points very much. Could also say "thinking of oneself or others as Cartesian agents causes a lack of power". Does agency=power? I'm not sure what the appropriate words are but I agree with your point.

I see a certain amount of connection here to, for example, developmental psychology, wh

... (read more)
Yeah I agree. There was a bit of discussion re conservation of energy here too. I do like thought experiments in cellular automata because of the spatially localized nature of the transition function, which matches our physics. Do you have any suggestions for automata that also have reversibility and conservation of energy?

4Paul Christiano3moI feel like they must exist (and there may not be that many simple nice ones). I expect someone who knows more physics could design them more easily. My best guess would be to get both properties by defining the system via some kind of discrete hamiltonian. I don't know how that works, i.e. if there is a way of making the hamiltonian discrete (in time and in values of the CA) that still gives you both properties and is generally nice. I would guess there is and that people have written papers about it. But it also seems like that could easily fail in one way or another. It's surprisingly non-trivial to find that by googling though I didn't try very hard. May look a bit more tonight (or think about it a bit since it seems fun). Finding a suitable replacement for the game of life that has good conservation laws + reversibility (while still having a similar level of richness) would be nice.
Wow, thank you Daniel, this is an incredibly helpful list!

But when you indicate in your comment below that you see the "AI hypothesis" and the "life hypothesis" as very similar, then that mainly seems to indicate that you're using a highly nonstandard definition of AI, which I expect will lead to confusion.

Well surely if I built a robot that was able to gather resources and reproduce itself as effectively as either a bacterium or a tree, I would be entirely justified in calling it an "AI". I would certainly have no problem using that terminology for such a construction at any mainstream robotics conference, ev... (read more)

Well in case it's relevant here, I actually almost wrote "the AI hypothesis" as "the life hypothesis" and phrased it as

Any pattern of physics that eventually exerts control over a region much larger than its initial configuration does so by means of perception, cognition, and action that are recognizably life-like.

Perhaps in this form it's too vague (what does "life-like" mean?) or too circular (we could just define life-like as having an outsized physical impact).

But in whatever way we phrase it, there is very much a substantial hypothesis under the h... (read more)

Well yes, I do think that trees and bacteria exhibit this phenomenon of starting out small and growing in impact. The scope of their impact is limited in our universe by the spatial separation between planets, and by the presence of even more powerful world-reshapers in their vicinity, such as humans. But on this view of "which entities are reshaping the whole cosmos around here?", I don't think there is a fundamental difference in kind between trees, bacteria, humans, and hypothetical future AIs. I do think there is a fundamental difference in kind betwee... (read more)

7Richard Ngo3moThere's at least one important difference: some of these are intelligent, and some of these aren't. It does seem plausible that the category boundary you're describing is an interesting one. But when you indicate in your comment below that you see the "AI hypothesis" and the "life hypothesis" as very similar, then that mainly seems to indicate that you're using a highly nonstandard definition of AI, which I expect will lead to confusion.
Romeo if you have time, would you say more about the connection between orthogonality and Life / the control question / the AI hypothesis? It seems related to me but I just can't quite put my finger on exactly what the connection is.

Yeah absolutely - see third bullet in the appendix. One way to resolve this would be to say that to succeed at answering the control question you have to succeed in at least 1% of randomly chosen environments.

You're talking about how we ground out our thinking in something that is true but is not just further conceptualization?

Look if we just make a choice about the truth by making an assumption then eventually the world really does "bite back". It's possible to try this out by just picking a certain fundamental orientation towards the world and sticking to it no matter what throughout your life for a little while. The more rigidly you adhere to it the more quickly the world will bite back. So I don't think we can just pick a grounding.

But at the same time I ve... (read more)

1G Gordon Worley III3moYep, that accords well with my own current view.
Well ok, agreed, but even if we were Cartesian, we would still have questions about what is the right way to link up our machines with this place where agentiness is coming from, how we discern whether we are in fact Cartesian or embedded, and so on down to the problem of the criterion as you described it.

One common response to any such difficult philosophical problems seems to be to just build AI that uses some form of indirect normativity such as CEV or HCH or AI debate to work out what wise humans would do about those philosophical problems. But I don't think it's so easy to sidestep the problem of the criterion.

1G Gordon Worley III3moOh, I don't think those things exactly sidestep the problem of the criterion so much as commit to a response to it without necessarily realizing that's what they're doing. All of them sort of punt on it by saying "let humans figure out that part", which at the end of the day is what any solution is going to do because we're the ones trying to build the AI and making the decisions, but we can be more or less deliberate about how we do this part.
That post was a delightful read! Thanks for the pointer.

It seems that we cannot ever find, among concepts, a firm foundation on which we can be absolutely sure of our footing. For the same reason, our basic systems of logic, ethics, and empiricism can never be put on absolutely sure footing (Godel, Humean is/ought gap, radical skepticism).

1G Gordon Worley III3moRight. For example, I think Stuart Armstrong is hitting something very important about AI alignment with his pursuit of the idea that there's no free lunch in value learning [] . We only close the gap by making an "arbitrary" assumption, but it's only arbitrary if you assume there's some kind of context-free version of the truth. Instead we can choose in a non-arbitrary way based on what we care about and is useful to us. I realize lots of people are bored by this point because they're non-arbitrary solution that is useful is some version of rationality criteria since those are very useful for not getting Dutch booked, for example, but we could just as well choose something else and humans, for example, seem to do just that, even though so far we'd be hard pressed to very precisely say just what it is that humans do assume to ground things in, although we have some clues of things that seem important, like staying alive.
Interesting. Is it that if we were Caresian, you'd expect to be able to look at the agent-outside-the-world to find answers to questions about what even is the right way to go about building AI?

1G Gordon Worley III3moNot really. If we were Cartesian, in order to fit the way we find the world, it seems to be that it'd have to be that agentiness is created outside the observable universe, possibly somewhere hypercomputation is possible, which might only admit an answer about how to build AI that would look roughly like "put a soul in it", i.e. link it up to this other place where agentiness is coming from. Although I guess if the world really looked like that maybe the way to do the "soul linkage" part would be visible, but it's not so seems unlikely.
Very very cool. Thank you for this drocta. What would it take to map out the sizes of the volumes corresponding to each of these mappings? Also, could you perhaps compute the exact Kolmogorov complexity of these mappings in some particular description language, since they are so small? It would be super interesting to me to assemble a table of volumes and Kolmogorov complexities for each of these small mappings. It may then be possible to write some code that does the same for 3-input and 4-input mappings.

Thanks for these pointers.

but large volume-->simple is what is proven in these papers(plus some empirical evidence of unclear import)

Is that the empirical evidence attempts to demonstrate simple --> large volume but is inconclusive, or is it that the empirical evidence does not even attempt to demonstrate simple --> large volume?

The evidence is empirical performance of Gaussian processes being similar to neural nets on simple tasks.

Well they do take many samples from what they call P_SGD and P_B and compare these as distributions, so it se... (read more)

Thanks for this clarification John.

In a sense, the results are more interesting in this light, since they tell us something about which specific ways of compressing things are relevant to our particular world

But did Mingard et al show that there is some specific practical complexity measure that explains the size of the volumes occupied in parameter space better than alternative practical complexity measures? If so then think we would have uncovered an even more detailed understanding of which mappings occupy large volumes in parameter space, and since... (read more)

2johnswentworth3moYeah, I don't think there was anything particularly special about the complexity measure they used, and I wouldn't be surprised if some other measures did as-well-or-better at predicting which functions fill large chunks of parameter space.
Yeah right, that is scarier. Looking forward to reading your argument, esp re why we would expect deceptive agents that score well to outnumber aligned agents that score well.

Although in the same sense we could say that a rock “contains” many deceptive agents, since if we viewed the rock as a giant mixture of computations then we would surely find some that implement deceptive agents.

Thank you for the pointer. Why is the tangent space hypothesis version of the LTH scarier?

4Daniel Kokotajlo3moWell, it seems to be saying that the training process basically just throws away all the tickets that score less than perfectly, and randomly selects one of the rest. This means that tickets which are deceptive agents and whatnot are in there from the beginning, and if they score well, then they have as much chance of being selected at the end as anything else that scores well. And since we should expect deceptive agents that score well to outnumber aligned agents that score well... we should expect deception. I'm working on a much more fleshed out and expanded version of this argument right now.
Yeah, I agree, logical induction bakes in the concept of time in a way that probability theory does not. And yeah, it does seem necessary, and I find it very interesting when I squint at it.

But the question is whether values are even the right way to go about this problem. That's the kind of information we're seeking: information about how even to go about being beneficial, and what beneficial really means. Does it really make sense to model a rainforest as an agent and back out a value function for it? If we did that, would it work out in a way that we could look back on and be glad about? Perhaps it would, perhaps it wouldn't, but the hard problem of AI safety is this question of what even is the right frame to start thinking about this in,... (read more)

Yes, I agree.

I once stayed in Andrew Critch's room for a few weeks while he was out of town. I felt that I was learning from him in his absence because he had all these systems and tools and ways that things were organized. I described it at the time as "living inside Critch's brain for two weeks", which was a great experience. Thanks Critch!

Yes, I agree, it's difficult to find explicit and specific language for what it is that we would really like to align AI systems with. Thank you for the reply. I would love to read such a story!

Thank you for the kind words.

for example, being aware that human intentions can change -- it's not obvious that the right move is to 'pop out' further and assume there is something 'bigger' that the human's intentions should be aligned with. Could you elaborate on your vision of what you have in mind there?

Well it would definitely be a mistake to build an AI system that extracts human intentions at some fixed point in time and treats them as fixed forever, yes? So it seems to me that it would be better to build systems predicated on that which is the u... (read more)

2Evan Hubinger7moNp! Also, just going through the rest of the proposals in my 11 proposals paper [], I'm realizing that a lot of the other proposals also try to avoid a full agency hand-off. STEM AI [] restricts the AI's agency to just STEM problems, narrow reward modeling [] restricts individual AIs to only apply their agency to narrow domains, and the amplification and debate proposals are trying to build corrigible question-answering systems rather than do a full agency hand-off.
Our choice is not between having humans run the world and having a benevolent god run the world.

Right, I agree that having a benevolent god run the world is not within our choice set.

Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).

Well just to re-state the suggestion in my original post: is this dichotomy between humans running the world or something else running the world really so inescapable? The child in the sand ... (read more)

2Paul Christiano7moI buy into the delegation framing, but I think that the best targets for delegation look more like "slightly older and wiser versions of ourselves with slightly more space" (who can themselves make decisions about whether to delegate to something more alien). In the sand-pit example, if the child opted into that arrangement then I would say they have effectively delegated to a version of themselves who is slightly constrained and shaped by the supervision of the adult. (But in the present situation, the most important thing is that the parent protects them from the outside the world while they have time to grow.)
Thank you for this jbash.

Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world

My short response is: Yes, it would be very bad for present-day humanity to have more power than it currently does, since its current level of power is far out of proportion to its level of wisdom and compassion. But it seems to me that there are a small number of humans on this planet who have moved some way in the direction of being fit to run the world, and in time, more humans could move in this direction, and could mov... (read more)

2Paul Christiano7moIf the humans in the container succeed in becoming wiser, then hopefully it is wise for us to leave this decision up to them than to preemptively make it now (and so I think the situation is even better than it sounds superficially). It seems like the real thing up for debate will be about power struggles amongst humans---if we had just one human, then it seems to me like the grandparent's position would be straightforwardly incoherent. This includes, in particular, competing views about what kind of structure we should use to govern ourselves in the future.
I very much agree with these two:

On the other hand, there are lots of people who really do want to help, for the right reason. So if growth is the goal, helping these people out seems like just an obvious thing to do

So I think there is a lot of room for growth, by just helping the people who are already involved and trying.

Thank you for this thoughtful comment Linda -- writing this replying has helped me to clarify my own thinking on growth and depth. My basic sense is this:

If I meet someone who really wants to help out with AI safety, I want to help them to do that, basically without reservation, regardless of their skill, experience, etc. My sense is that we have a huge and growing challenge in navigating the development of advanced AI, and there is just no shortage of work to do, though it can at first be quite difficult to find. So when I meet individuals, I will try to ... (read more)

3Linda Linsefors7moOk, that makes sense. Seems like we are mostly on the same page then. I don't have strong opinions weather drawing in people via prestige is good or bad. I expect it is probably complicated. For example, there might be people who want to work on AI Safety for the right reason, but are too agreeable to do it unless it reach some level of acceptability. So I don't know what the effects will be on net. But I think it is an effect we will have to handle, since prestige will be important for other reasons. On the other hand, there are lots of people who really do want to help, for the right reason. So if growth is the goal, helping these people out seems like just an obvious thing to do. I expect there are ways funders can help out here too. I would not update much on the fact that currently most research is produced by existing institutions. It is hard to do good research, and even harder with out collogues, sallary and other support that comes with being part of an org. So I think there is a lot of room for growth, by just helping the people who are already involved and trying.
Ah this is helpful, thank you.

So let's say I'm estimating the position of a train on a straight section of track as a single real number and I want to do an update each time I receive a noisy measurement of the train's position. Under the theory you're laying out here I might have, say, three Gaussians N(0, 1), N(1, 10), N(4, 6), and rather than updating a single pdf over the position of the train, I'm updating measures associated with each of these three pdf. Is that roughly correct?

(I realize this isn't exactly a great example of how to use this theory s... (read more)

1Vanessa Kosoy6moI'm not sure I understood the question, but the infra-Bayesian update is not equivalent to updating every distribution in the convex set of distributions. In fact, updating a crisp infra-distribution (i.e. one that can be described as a convex set of distributions) in general produces an infra-distribution that is not crisp (i.e. you need sa-measures to describe it or use the Legendre dual view).
Thank you for your work both in developing this theory and putting together this heroic write-up! It's really a lot of work to write all this stuff out.

I am interested in understanding the thing you're driving at here, but I'm finding it difficult to navigate because I don't have much of a sense for where the definitions are heading towards. I'm really looking for an explanation of what exactly is made possible by this theory, so that as I digest each of the definitions I have a sense for where this is all heading.

My current understanding is that this is a... (read more)

3Diffractor7moSo, first off, I should probably say that a lot of the formalism overhead involved in this post in particular feels like the sort of thing that will get a whole lot more elegant as we work more things out, but "Basic inframeasure theory" still looks pretty good at this point and worth reading, and the basic results (ability to translate from pseudocausal to causal, dynamic consistency, capturing most of UDT, definition of learning) will still hold up. Yes, your current understanding is correct, it's rebuilding probability theory in more generality to be suitable for RL in nonrealizable environments, and capturing a much broader range of decision-theoretic problems, as well as whatever spin-off applications may come from having the basic theory worked out, like our infradistribution logic stuff. It copes with unrealizability because its hypotheses are not probability distributions, but sets of probability distributions (actually more general than that, but it's a good mental starting point), corresponding to properties that reality may have, without fully specifying everything. In particular, if an agent learns a class of belief functions (read: properties the environment may fulfill) is learned, this implies that for all properties within that class that the true environment fulfills (you don't know the true environment exactly), the infrabayes agent will match or exceed the expected utility lower bound that can be guaranteed if you know reality has that property (in the low-time-discount limit) There's another key consideration which Vanessa was telling me to put in which I'll post in another comment once I fully work it out again. Also, thank you for noticing that it took a lot of work to write all this up, the proofs took a while. n_n
Yeah so to be clear, I do actually think strategy research is pretty important, I just notice that in practice most of the strategy write-ups that I actually read do not actually enlighten me very much, whereas it's not so uncommon to read technical write-ups that seem to really move our understanding forward. I guess it's more that doing truly useful strategy research is just ultra difficult. I do think that, for example, some of Bostrom's and Yudkowsky's early strategy write-ups were ultra useful and important.

Nice post, very much the type of work I'd like to see more of.

Thank you!

I'm not sure I'd describe this work as "notorious", even if some have reservations about it.

Oops, terrible word choice on my part. I edited the article to say "gained attention" rather than "gained notoriety".

I think this is incorrect - for example, "biological systems are highly modular, at multiple different scales". And I expect deep learning to construct minds which are also fairly modular. That also allows search to be more useful, because it can make changes which are co

... (read more)
And thus the wheel of the Dharma was set in motion once again, for one more great turning of time

3Adam Shimi1yIf there was a vote for the best comment thread of 2020, that would probably be it for me.
Ah this is a different Ben.

1Ben Pace1yThen I will prepare for combat.
I think this is a very good summary

2Rohin Shah1yThanks :)
