DanielFilan's Comments

Bottle Caps Aren't Optimisers

Daniel Filan's bottle cap example

Note that Abram Demski deserves a large part of the credit for that specific example (somewhere between 'half' and 'all'), as noted in the final sentence of the post.

Open question: are minimal circuits daemon-free?

This post formulated a concrete open problem about what are now called 'inner optimisers'. For me, it added 'surface area' to the concept of inner optimisers in a way that I think was healthy and important for their study. It also spurred research that resulted in this post giving a promising framework for a negative answer.

Rohin Shah on reasons for AI optimism


I don’t know that MIRI actually believes that what we need to do is write a bunch of proofs about our AI system, but it sure sounds like it, and that seems like a too difficult, and basically impossible task to me, if the proofs that we’re trying to write are about alignment or beneficialness or something like that.

FYI: My understanding of what MIRI (or at least Buck) thinks is that you don't need to prove your AI system is beneficial, but you should have a strong argument that stands up to strict scrutiny, and some of the sub-arguments will definitely have to be proofs.

RS Seems plausible, I think I feel similarly about that claim

Rohin Shah on reasons for AI optimism


I also don’t think there’s a discrete point at which you can say, “I’ve won the race.” I think it’s just like capabilities keep improving and you can have more capabilities than the other guy, but at no point can you say, “Now I have won the race.”

I think that (a) this isn't a disanalogy to nuclear arms races and (b) it's a sign of danger, since at no point do people feel free to slow down and test safety.

RS I’m confused by (a). Surely you “win” the nuclear arms race once you successfully make a nuke that can be dropped on another country?

(b) seems right, idr if I was arguing for safety or just arguing for disanalogies and wanting more research

DF re (a), if you have nukes that can be dropped on me, I can then make enough nukes to destroy all your nukes. So you make more nukes, so I make more nukes (because I'm worried about my nukes being destroyed) etc. This is historically how it played out, see mid-20th C discussion of the 'missile gap'.

re (b) fair enough

(it doesn't actually necessarily play out as clearly as I describe: maybe you get nuclear submarines, I get nuclear submarine detection skills...)

RS (a) Yes, after the first nukes are created, the remainder of the arms race is relatively similar. I was thinking of the race to create the first nuke. (Arguably the US should have used their advantage to prevent all further nukes.)

DF I guess it just seems more natural to me to think of one big long arms race, rather than a bunch of successive races - like, I think if you look at the actual history of nuclear armament, at no point before major powers have tons of nukes are they in a lull, not worrying about making more. But this might be an artefact of me mostly knowing about the US side, which I think was unusual in its nuke production and worrying.

RS Seems reasonable, I think which frame you take will depend on what you’re trying to argue, I don’t remember what I was trying to argue with that. My impression was that when people talk about the “nuclear arms race”, they were talking about the one leading to the creation of the bomb, but I’m not confident in that (and can’t think of any evidence for it right now)


My impression was that when people talk about the “nuclear arms race”, they were talking about the one leading to the creation of the bomb

ah, I did not have that impression. Makes sense.

Rohin Shah on reasons for AI optimism

(Looking back on this, I'm now confused why Rohin doesn't think mesa-optimisers wouldn't end up being approximately optimal for some objective/utility function)

Rohin Shah on reasons for AI optimism


I think it would be… AGI would be a mesa optimizer or inner optimizer, whichever term you prefer. And that that inner optimizer will just sort of have a mishmash of all of these heuristics that point in a particular direction but can’t really be decomposed into ‘here are the objectives, and here is the intelligence’, in the same way that you can’t really decompose humans very well into ‘here are the objectives and here is the intelligence’.

... but it leads to not being as confident in the original arguments. It feels like this should be pushing in the direction of ‘it will be easier to correct or modify or change the AI system’. Many of the arguments for risk are ‘if you have a utility maximizer, it has all of these convergent instrumental sub-goals’ and, I don’t know, if I look at humans they kind of sort of pursued convergent instrumental sub-goals, but not really.

Huh, I see your point as cutting the opposite way. If you have a clean architectural separation between intelligence and goals, I can swap out the goals. But if you have a mish-mash, then for the same degree of vNM rationality (which maybe you think is unrealistic), it's harder to do anything like 'swap out the goals' or 'analyse the goals for trouble'.

in general, I think the original arguments are: (a) for a very wide range of objective functions, you can have agents that are very good at optimising them (b) convergent instrumental subgoals are scary

I think 'humans don't have scary convergent instrumental subgoals' is an argument against (b), but I don't think (a) or (b) rely on a clean architectural separation between intelligence and goals.

RS I agree both (a) and (b) don’t depend on an architectural separation. But you also need (c): agents that we build are optimizing some objective function, and I think my point cuts against that

DF somewhat. I think you have a remaining argument of 'if we want to do useful stuff, we will build things that optimise objective functions, since otherwise they randomly waste resources', but that's definitely got things to argue with.

Rohin Shah on reasons for AI optimism


A straw version of this, which isn’t exactly what I mean but sort of is the right intuition, would be like maybe if you run the same… What’s the input that maximizes the output of this neuron? You’ll see that this particular neuron is a deception classifier. It looks at the input and then based on something, does some computation with the input, maybe the input’s like a dialogue between two people and then this neuron is telling you, “Hey, is person A trying to deceive person B right now?” That’s an example of the sort of thing I am imagining.

Huh - plausible that I'm misunderstanding you, but I imagine this being insufficient for safety monitoring because (a) many non-deceptive AIs are going to have the concept of deception anyway, because it's useful, (b) statically you can't tell whether or not the network is going to aim for deception just from knowing that it has a representation of deception, and (c) you don't have a hope of monitoring it online to check if the deception neuron is lighting up when it's talking to you.

FWIW I believe in the negation of some version of my point (b), where some static analysis reveals some evaluation and planning model, and you find out that in some situations the agent prefers itself being deceptive, where of course this static analysis is significantly more sophisticated than current techniques

RS Yeah, I agree with all of these critiques. I think I’m more pointing at the intuition at why we should expect this to be easier than we might initially think, rather than saying that specific idea is going to work.

E.g. maybe this is a reason that (relaxed) adversarial training actually works great, since the adversary can check whether the deception neuron is lighting up

DF Seems fair, and I think this kind of intuition is why I research what I do.

Rohin Shah on reasons for AI optimism


And the concept of 3D space seems like it’s probably going to be useful for an AI system no matter how smart it gets. Currently, they might have a concept of 3D space, but it’s not obvious that they do. And I wouldn’t be surprised if they don’t.

Presumably at some point they start actually using the concept of 4D locally-Minkowski spacetime instead (or quantum loops or whatever)

and in general - if you have things roughly like human notions of agency or cause, but formalised differently and more correctly than we would, that makes them harder to analyse.

RS I suspect they don’t use 4D spacetime, because it’s not particularly useful for most tasks, and takes more computation.

But I agree with the broader point that abstractions can be formalized differently, and that there can be more alien abstractions. But I’d expect that this happens quite a bit later

DF I mean maybe once you've gotten rid of the pesky humans and need to start building dyson spheres... anyway I think curved 4d spacetime does require more computation than standard 3d modelling, but I don't think that using minkowski spacetime does.

RS Yeah, I think I’m often thinking of the case where AI is somewhat better than humans, rather than building Dyson spheres. Who knows what’s happening at Dyson sphere level. Probably should have said that in the conversation. (I think about it this way because it seems more important to align the first few AIs, and then have them help with aligning future ones.)

DF Sure. But even when you have AI that's worrying about signal transmission between different cities and the GPS system, SR is not that much more computationally intensive than Newtonian 3D space, and critical for accuracy.

Like I think the additional computational cost is in fact very low, but non-negative.

RS So like in practice if robots end up doing tasks like the ones we do, they develop intuitive physics models like ours, rather than Newtonian mechanics. SR might be only a bit more expensive than Newtonian, but I think most of the computational cost is in switching from heuristics / intuitive physics to a formal theory

(If they do different tasks than what we do, I expect them to develop their own internal physics which is pretty different from ours that they use for most tasks, but still not a formal theory)

DF Ooh, I wasn't accounting for that but it seems right.

I do think that plausibly in some situations 'intuitive physics' takes place in minkowski spacetime.

Rohin Shah on reasons for AI optimism

DF From your AI impacts interview:

And then I claim that conditional on that scenario having happened, I am very surprised by the fact that we did not know this deception in any earlier scenario that didn’t lead to extinction. And I don’t really get people’s intuitions for why that would be the case. I haven’t tried to figure that one out though.

I feel like I believe that people notice deception early on but are plausibly wrong about whether or not they've fixed it

RS After a few failures, you’d think we’d at least know to expect it?

DF Sure, but if your AI is also getting smarter, then that probably doesn't help you that much in detecting it, and only one person has to be wrong and deploy (if actually fixing takes a significantly longer time than sort of but not really fixing it) [this comment was written with less than usual carefulness]

RS Seems right, but in general human society / humans seem pretty good at being risk-averse (to the point that it seems to me that on anything that isn’t x-risk the utilitarian thing is to be more risk-seeking), and I’m hopeful that the same will be true here. (Also I’m assuming that it would take a bunch of compute, and it’s not that easy for a single person to deploy an AI, though even in that case I’d be optimistic, given that smallpox hasn’t been released yet.)

DF sorry by 'one person' I meant 'one person in charge of a big team'

RS The hope is that they are constrained by all the typical constraints on such people (shareholders, governments, laws, public opinion, the rest of the team, etc.) Also this significantly decreases the number of people who can do the thing, restricts it to people who are “broadly reasonable” (e.g. no terrorists), and allows us to convince each such person individually. Also I rarely think there is just one person — at the very least you need one person with a bunch of money and resources and another with the technical know-how, and it would be very difficult for these to be the same person

DF Sure. I guess even with those caveats my scenario doesn't seem that unlikely to me.

RS Sure, I don’t think this is enough to say “yup, this definitely won’t happen”. I think we do disagree on the relative likelihood of it happening, but maybe not by that much. (I’m hesitant to write a number because the scenario isn’t really fleshed out enough yet for us to agree on what we’re writing a number about.)

Rohin Shah on reasons for AI optimism

I had a chat with Rohin about portions of this interview in an internal slack channel, which I'll post as replies to this comment (there isn't much shared state between different threads, I think).

Load More