Sorted by New

Wiki Contributions


Without digging in too much, I'll say that this exchange and the OP is pretty confusing to me. It sounds like MB is like "MIRI doesn't say it's hard to get an AI that has a value function" and then also says "GPT has the value function, so MIRI should update". This seems almost contradictory.

A guess: MB is saying "MIRI doesn't say the AI won't have the function somewhere, but does say it's hard to have an externally usable, explicit human value function". And then saying "and GPT gives us that", and therefore MIRI should update.

And EY is blobbing those two things together, and saying neither of them is the really hard part. Even having the externally usable explicit human value function doesn't mean the AI cares about it. And it's still a lot of bits, even if you have the bits. So it's still true that the part about getting the AI to care has to go precisely right.

If there's a substantive disagreement about the facts here (rather than about the discourse history or whatever), maybe it's like:

Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.

Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.

Viruses are optimizing their own spread, not killing all humans. This seems to be further optimizing an > already highly optimized artifact, not flowing optimization through an optimized channel.

Well I'm saying that the virus's ability to penetrate the organism, penetrate cells and nuclei, and hijack the DNA transcription machinery, is a channel. It already exists and was optimized to transmit optimization power: selection on the viral genome is optimization, and it passes through this channel, in that this channel allows the viral genome (when outside of another organism) to modify the behavior of an organism's cells.

(For the record I didn't downvote your original post and don't know why anyone would.)

  1. How to measure / detect "is optimized to convey optimization"? What's "a channel"?
  2. Humans are far from the only optimizers. This would seem to allow e.g. engineering a modification to a virus that kills all humans, because viruses are optimized. Pointing at "is optimized by humans to convey optimization" seems much harder than just "is optimized to convey optimization".
  3. What's "running optimization power through a channel"? If I set off a nuclear bomb, am I "running optimization through the air" by passing a shockwave through it? If no, then there's a lot of stuff we're ruling out, and if yes, then how can this thing still be pivotal?

I glanced at the first paper you cited, and it seems to show a very weak form of the statements you made. AFAICT their results are more like "we found brain areas that light up when the person reads 'cat', just like how this part of the neural net lights up when given input 'cat'" and less like "the LLM is useful for other tasks in the same way as the neural version is useful for other tasks". Am I confused about what the paper says, and if so, how? What sort of claim are you making?

Essentially, the assumption I made explicitly, which is that there exists a policy which achieves shutdown with probability 1.

Oops, I missed that assumption. Yeah, if there's such a policy, and it doesn't trade off against fetching the coffee, then it seems like we're good. See though here, arguing briefly that by Cromwell's rule, this policy doesn't exist. https://arbital.com/p/task_goal/ 

Even with a realistic  probability of shutdown failing, if we don’t try to juice  so high that it exceeds , my guess is there would not be enough incentive to justify the cost of building a successor agent just to raise that from  to .

Hm. So this seems like you're making an additional, very non-trivial assumption, which is that the AI is constrained by costs comparable to / bigger than the costs to create a successor. If its task has already been very confidently achieved, and it has half a day left, it's not going to get senioritis, it's going to pick up whatever scraps of expected utility might be left. 

I wonder though if there's synergy between your proposal and the idea of expected utility satisficing: an EU satisficer with a shutdown clock is maybe anti-incentivized from self-modifying to do unbounded optimization, because unbounded optimization is harder to reliably shut down? IDK. 

Problem: suppose the agent foresees that it won't be completely sure that a day has passed, or that it has actually shut down. Then the agent A has a strong incentive to maintain control over the world past when it shuts down, to swoop in and really shut A down if A might not have actually shut down and if there might still be time. This puts a lot of strain on the correctness of the shutdown criterion: it has to forbid this sort of posthumous influence despite A optimizing to find a way to have such influence. 
(The correctness might be assumed by the shutdown problem, IDK, but it's still an overall issue.)

Another comment: this doesn't seem to say much about corrigibility, in the sense that it's not like the AI is now accepting correction from an external operator (the AI would prevent being shut down during its day of operation). There's no dependence on an external operator's choices (except that once the AI is shut down the operator can pick back up doing whatever, if they're still around). It seems more like a bounded optimization thing, like specifying how the AI can be made to not keep optimizing forever. 

This seems in danger of being a "sponge alignment" proposal, i.e. the proposed system doesn't do anything useful. https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#:~:text=sponge 

Certainly it doesn't matter what substrate the computation is running on.

I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.

Reasoning correctly about far-reaching consequences by default (1) has mistargeted consequences, and (2) is done by summoning a dangerous reasoner.

Such optimizers can still end up producing actions with far-reaching consequences on the world if they deploy their optimization power in the service of an objective like imitating HCH that requires producing actions with particular consequences, however.

I think what you're saying here implies that you think it is feasible to assemble myopic reasoners into a non-myopic reasoner, without compromising safety. My possibly straw understanding, is that the way this is supposed to happen in HCH is that, basically, the humans providing the feedback train the imitator(s) to implement a collective message-passing algorithm that answers any reasonable question or whatever. This sounds like a non-answer, i.e. it's just saying "...and then the humans somehow assemble myopic reasoners into a non-myopic reasoner". Where's the non-myopicness? If there's non-myopicness happening in each step of the human consulting HCH, then the imitator is imitating a non-myopic reasoner and so is non-myopic (and this is compounded by distillation steps). If there isn't non-myopicness happening in each step, how does it come in to the assembly?

Well, a main reason we'd care about codespace distance, is that it tells us something about how the agent will change as it learns (i.e. moves around in codespace). (This is involving time, since the agent is changing, contra your picture.) So a key (quasi)metric on codespace would be, "how much" learning does it take to get from here to there. The if True: x() else: y() program is an unnatural point in codespace in this metric: you'd have to have traversed the both the distances from null to x() and from null to y(), and it's weird to have traversed a distance and make no use of your position. A framing of the only-X problem is that traversing from null to a program that's an only-Xer according to your definition, might also constitute traversing almost all of the way from null to a program that's an only-Yer, where Y is "very different" from X.

Thanks for trying to clarify "X and only X", which IMO is a promising concept.

One thing we might want from an only-Xer is that, in some not-yet-formal sense, it's "only trying to X" and not trying to do anything else. A further thing we might want is that the only-Xer only tries to X, across some relevant set of counterfactuals. You've discussed the counterfactuals across possible environments. Another kind of counterfactual is across modifications of the only-Xer. Modification-counterfactuals seem to point to a key problem of alignment: how does this generalize? If we've selected something to do X, within some set of environments, what does that imply about how it'll behave outside of that set of environments? It looks like by your definition we could have a program that's a very competent general intelligence with a slot for a goal, plus a pointer to X in that slot; and that program would count as an only-Xer. This program would be very close, in some sense, to programs that optimize competently for not-X, or for a totally unrelated Y. That seems counterintuitive for my intuitive picture of an "X and only X"er, so either there's more to be said, or my picture is incoherent.

Load More