All of TekhneMakre's Comments + Replies

Without digging in too much, I'll say that this exchange and the OP is pretty confusing to me. It sounds like MB is like "MIRI doesn't say it's hard to get an AI that has a value function" and then also says "GPT has the value function, so MIRI should update". This seems almost contradictory.

A guess: MB is saying "MIRI doesn't say the AI won't have the function somewhere, but does say it's hard to have an externally usable, explicit human value function". And then saying "and GPT gives us that", and therefore MIRI should update.

And EY is blobbing those two... (read more)

A guess: MB is saying "MIRI doesn't say the AI won't have the function somewhere, but does say it's hard to have an externally usable, explicit human value function". And then saying "and GPT gives us that", and therefore MIRI should update.

[...]

Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.

Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of

... (read more)

Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.

Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.

To which I say: "dial a random phone number and ask the person who answers what's good" can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to lev... (read more)

Viruses are optimizing their own spread, not killing all humans. This seems to be further optimizing an > already highly optimized artifact, not flowing optimization through an optimized channel.

Well I'm saying that the virus's ability to penetrate the organism, penetrate cells and nuclei, and hijack the DNA transcription machinery, is a channel. It already exists and was optimized to transmit optimization power: selection on the viral genome is optimization, and it passes through this channel, in that this channel allows the viral genome (when outsi... (read more)

1Donald Hobson9mo
Yeah, probably. However, note that it can only use this channel if a human has deliberately made an optimization channel that connects in to this process. Ie the AI isn't allowed to invent DNA printers itself. I think a bigger flaw is where one human decided to make a channel from A to B, another human made a channel from B to C ... until in total there is a channel from A to Z that no human wants and no human knows exists, built entirely out of parts that humans build.  Ie person 1 decides the AI should be able to access the internet. Person 2 decided that anyone on the internet should be able to run arbitrary code on their programming website, and the AI puts those together, even when no human did. Is that a failure of this design? Not sure. Can't get a clear picture until I have actual maths.
  1. How to measure / detect "is optimized to convey optimization"? What's "a channel"?
  2. Humans are far from the only optimizers. This would seem to allow e.g. engineering a modification to a virus that kills all humans, because viruses are optimized. Pointing at "is optimized by humans to convey optimization" seems much harder than just "is optimized to convey optimization".
  3. What's "running optimization power through a channel"? If I set off a nuclear bomb, am I "running optimization through the air" by passing a shockwave through it? If no, then there's a lot of stuff we're ruling out, and if yes, then how can this thing still be pivotal?
1Donald Hobson9mo
I am trying to write something that would make sense if I had as solid and mathy idea of "optimization here" as I do with "information here".  Viruses are optimizing their own spread, not killing all humans. This seems to be further optimizing an already highly optimized artifact, not flowing optimization through an optimized channel.  I am not sure, I think it depends on why the AI wants the shockwave. Again, all I have is a fuzzy intuition that says yes in some cases, no in others, and shrugs in a lot of cases. I am trying to figure out if I can get this into formal maths. And if I succeed, I will (probably, unless infohazard or something) describe the formal maths. 

I glanced at the first paper you cited, and it seems to show a very weak form of the statements you made. AFAICT their results are more like "we found brain areas that light up when the person reads 'cat', just like how this part of the neural net lights up when given input 'cat'" and less like "the LLM is useful for other tasks in the same way as the neural version is useful for other tasks". Am I confused about what the paper says, and if so, how? What sort of claim are you making?

Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1.

Oops, I missed that assumption. Yeah, if there's such a policy, and it doesn't trade off against fetching the coffee, then it seems like we're good. See though here, arguing briefly that by Cromwell's rule, this policy doesn't exist. https://arbital.com/p/task_goal/ 

Even with a realistic  probability of shutdown failing, if we don’t try to juice  so high that it exceeds , my guess is there woul

... (read more)
1davidad (David A. Dalrymple)1y
Yes, I think there are probably strong synergies with satisficing, perhaps lexicographically minimizing something like energy expenditure once the EU maximum is reached. I will think about this more.

Problem: suppose the agent foresees that it won't be completely sure that a day has passed, or that it has actually shut down. Then the agent A has a strong incentive to maintain control over the world past when it shuts down, to swoop in and really shut A down if A might not have actually shut down and if there might still be time. This puts a lot of strain on the correctness of the shutdown criterion: it has to forbid this sort of posthumous influence despite A optimizing to find a way to have such influence. 
(The correctness might be assumed by the... (read more)

2davidad (David A. Dalrymple)1y
To the first point, I think this problem can be avoided with a much simpler assumption than that the shutdown criterion forbids all posthumous influence. Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1. (We might need a slightly stronger version of this assumption: it might need to be the case that for any action, there exists an action which has the same external effect but also causes a shutdown with probability 1.) This means that the agent doesn’t need to build itself any insurance policy to guarantee that it shuts down. I think this is not a terribly inaccurate assumption; of course, in reality, there are cosmic rays and a properly embedded and self-aware agent might deduce that none of its future actions are perfectly reliable, even though a model-free RL agent would probably never see any evidence of this (and it wouldn’t be any worse at folding the laundry for it). Even with a realistic ϵ probability of shutdown failing, if we don’t try to juice 1−1/C so high that it exceeds 1−ϵ, my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from 1−ϵ to 1.

This seems in danger of being a "sponge alignment" proposal, i.e. the proposed system doesn't do anything useful. https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#:~:text=sponge 

1Donald Hobson1y
This current version is dumb, but still exerts some optimization pressure. (Just the bits of optimization out are at most the bits of selection put into its utility function.) 
Certainly it doesn't matter what substrate the computation is running on.

I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implement... (read more)

3Alex Turner1y
Why this seems true: 1. Any planning process which robustly succeeds must behave differently in the presence of different latent problems.  1. If I'm going to the store and one of two routes may be closed down, and I want to always arrive at the store, my plan must somehow behave differently in the presence of the two possible latent complications (the road which is closed).  2. A pivotal act requires a complicated plan with lots of possible latent problems.  1. Any implementing process (like an AI) which robustly enacts a complicated plan (like destroying unaligned AGIs) must somehow behave differently in the presence of many different problems (like the designers trying to shut down the AI).  Thus, robustly pulling off a pivotal act requires some kind of "reasoning about far-reaching consequences" on the latent world state.

Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.

Endorsed.

Well, a main reason we'd care about codespace distance, is that it tells us something about how the agent will change as it learns (i.e. moves around in codespace). (This is involving time, since the agent is changing, contra your picture.) So a key (quasi)metric on codespace would be, "how much" learning does it take to get from here to there. The if True: x() else: y() program is an unnatural point in codespace in this metric: you'd have to have traversed the both the distances from null to x() and from null to y(), and it's weird to have traversed a dis... (read more)

1Donald Hobson2y
I don't think that learning is moving around in codespace. In the simplest case, the AI is like any other non self modifying program. The code stays fixed as the programmers wrote it. The variables update. The AI doesn't start from null. The programmer starts from a blank text file, and adds code. Then they run the code. The AI can start with sophisticated behaviour the moment its turned on. So are we talking about a program that could change from an X er to a Y er with a small change in the code written, or with a small amount of extra observation of the world?

Thanks for trying to clarify "X and only X", which IMO is a promising concept.

One thing we might want from an only-Xer is that, in some not-yet-formal sense, it's "only trying to X" and not trying to do anything else. A further thing we might want is that the only-Xer only tries to X, across some relevant set of counterfactuals. You've discussed the counterfactuals across possible environments. Another kind of counterfactual is across modifications of the only-Xer. Modification-counterfactuals seem to point to a key problem of alignment: how does this gene... (read more)

0Donald Hobson2y
My picture of an X and only X er is that the actual program you run should optimize only for X. I wasn't considering similarity in code space at all.  Getting the lexicographically first formal ZFC proof of say the Collatz conjecture should be safe. Getting a random proof sampled from the set of all proofs < 1 terabyte long should be safe. But I think that there exist proofs that wouldn't be safe. There might be a valid proof of the conjecture that had the code for a paperclip maximizer encoded into the proof, and that exploited some flaw in computers or humans to bootstrap this code into existence. This is what I want to avoid.  Your picture might be coherent and formalizable into some different technical definition. But you would have to start talking about difference in codespace, which can differ depending on different programming languages.  The program if True: x() else: y() is very similar in codespace to if False: x() else: y() .  If code space is defined in terms of minimum edit distance, then layers of interpereters, error correction and holomorphic encryption can change it. This might be what you are after, I don't know.