Sorted by New

Wiki Contributions


A positive case for how we might succeed at prosaic AI alignment
Certainly it doesn't matter what substrate the computation is running on.

I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences. Then (I'm guessing) Yudkowsky is applying that conservation law to [a big assemblage of myopic reasoners which outputs far-reaching plans], and concluding that either the reasoners weren't myopic, or else the assemblage implements a non-myopic reasoner with the myopic reasoners as a (mere) substrate.

Reasoning correctly about far-reaching consequences by default (1) has mistargeted consequences, and (2) is done by summoning a dangerous reasoner.

Such optimizers can still end up producing actions with far-reaching consequences on the world if they deploy their optimization power in the service of an objective like imitating HCH that requires producing actions with particular consequences, however.

I think what you're saying here implies that you think it is feasible to assemble myopic reasoners into a non-myopic reasoner, without compromising safety. My possibly straw understanding, is that the way this is supposed to happen in HCH is that, basically, the humans providing the feedback train the imitator(s) to implement a collective message-passing algorithm that answers any reasonable question or whatever. This sounds like a non-answer, i.e. it's just saying "...and then the humans somehow assemble myopic reasoners into a non-myopic reasoner". Where's the non-myopicness? If there's non-myopicness happening in each step of the human consulting HCH, then the imitator is imitating a non-myopic reasoner and so is non-myopic (and this is compounded by distillation steps). If there isn't non-myopicness happening in each step, how does it come in to the assembly?

Optimization, speculations on the X and only X problem.

Well, a main reason we'd care about codespace distance, is that it tells us something about how the agent will change as it learns (i.e. moves around in codespace). (This is involving time, since the agent is changing, contra your picture.) So a key (quasi)metric on codespace would be, "how much" learning does it take to get from here to there. The if True: x() else: y() program is an unnatural point in codespace in this metric: you'd have to have traversed the both the distances from null to x() and from null to y(), and it's weird to have traversed a distance and make no use of your position. A framing of the only-X problem is that traversing from null to a program that's an only-Xer according to your definition, might also constitute traversing almost all of the way from null to a program that's an only-Yer, where Y is "very different" from X.

Optimization, speculations on the X and only X problem.

Thanks for trying to clarify "X and only X", which IMO is a promising concept.

One thing we might want from an only-Xer is that, in some not-yet-formal sense, it's "only trying to X" and not trying to do anything else. A further thing we might want is that the only-Xer only tries to X, across some relevant set of counterfactuals. You've discussed the counterfactuals across possible environments. Another kind of counterfactual is across modifications of the only-Xer. Modification-counterfactuals seem to point to a key problem of alignment: how does this generalize? If we've selected something to do X, within some set of environments, what does that imply about how it'll behave outside of that set of environments? It looks like by your definition we could have a program that's a very competent general intelligence with a slot for a goal, plus a pointer to X in that slot; and that program would count as an only-Xer. This program would be very close, in some sense, to programs that optimize competently for not-X, or for a totally unrelated Y. That seems counterintuitive for my intuitive picture of an "X and only X"er, so either there's more to be said, or my picture is incoherent.