TheMajor — AI Alignment Forum

I have a bit of time on my hands, so I thought I might try to answer some of your questions. Of course I can't speak for TurnTrout, and there's a decent chance that I'm confused about some of the things here. But here is how I think about AUP and the points raised in this chain:

"AUP is not about the state" - I'm going to take a step back, and pretend we have an agent working with AUP reasoning. We've specified an arcane set of utility functions (based on air molecule positions, well-defined human happiness, continued existence, whatever fits in the mathematical framework). Next we have an action A available, and would like to compute the impact of that action. To do this our agent would compare how well it would be able to optimize each of those arcane utility functions in the world where A was taken, versus how well it would be able to optimize these utility functions in the world where the rest action was taken instead. This is "not about state" in the sense that the impact is determined by the change in the ability for the agent to optimize these arcane utilities, not by the change in the world state. In the particular case where the utility function is specified all the way down to sensory inputs (as opposed to elements of the world around us, which have to be interpreted by the agent first) this doesn't explicitly refer to the world around us at all (although of course implicitly the actions and sensory inputs of the agent are part of the world)! The thing being measured is the change in ability to optimize future observations, where what is a 'good' observation is defined by our arcane set of utility functions.
"overfitting the environment" - I'm not too sure about this one, but I'll have a crack at it. I think this should be interpreted as follows: if we give a powerful agent a utility function that doesn't agree perfectly with human happiness, then the wrong thing is being optimized. The agent will shape the world around us to what is best according to the utility function, and this is bad. It would be a lot better (but still less than perfect) if we had some way of forcing this agent to obey general rules of simplicity. The idea here is that our bad proxy utility function is at least somewhat correlated with actual human happiness under everyday circumstances, so as long as we don't suddenly introduce a massively powerful agent optimizing something weird (oops) to massively change our lives we should be fine. So if we can give our agent a limited 'budget' - in the case of fitting a curve to a dataset this would be akin to the number of free parameters - then at least things won't go horribly wrong, plus we expect these simpler actions to have less unintended side-effects outside the domain we're interested in. I think this is what is meant, although I don't really like the terminology "overfitting the environment".
"The long arms of opportunity cost and instrumental convergence" - this point is actually very interesting. In the first bullet point I tried to explain a little bit about how AUP doesn't directly depend on the world state (it depends on the agent's observations, but without an ontology that doesn't really tell you much about the world), instead all its gears are part of the agent itself. This is really weird. But it also lets us sidestep the issue of human value learning - if you don't directly involve the world in your impact measure, you don't need to understand the world for it to work. The real question is this one: "how could this impact measure possibly resemble anything like 'impact' as it is intuitively understood, when it doesn't involve the world around us?" The answer: "The long arms of opportunity cost and instrumental convergence". Keep in mind we're defining impact as change in the ability to optimize future observations. So the point is as follows: you can pick any absurd utility function you want, and any absurd possible action, and odds are this is going to result in some amount of attainable utility change compared to taking the null action. In particular, precisely those actions that massively change your ability to make big changes to the real world will have a big impact even on arbitrary utility functions! This sentence is so key I'm just going to repeat it with more emphasis: the actions that massively change your ability to make big changes in the world - i.e. massive decreases of power (like shutting down) but also massive increases in power - have big opportunity costs/benefits compared to the null action for a very wide range of utility functions. So these get assigned very high impact, even if the utility function set we use is utter hokuspokus! Now this is precisely instrumental convergence, i.e. the claim that for many different utility functions the first steps of optimizing them involves "make sure you have sufficient power to enforce your actions to optimize your utility function". So this gives us some hope that TurnTrout's impact measure will correspond to intuitive measures of impact even if the utility functions involved in the definition are not at all like human values (or even like a sensible category in the real world at all)!
"Wirehead a utility function" - this is the same as optimizing a utility function, although there is an important point to be made here. Since our agent doesn't have a world-model (or at least, shouldn't need one for a minimal working example), it is plausible the agent can optimize a utility function by hijacking its own input stream, or something of the sorts. This means that its attainable utility is at least partially determined by the agent's ability to 'wirehead' to a situation where taking the rest action for all future timesteps will produce a sequence of observations that maximizes this specific utility function, which if I'm not mistaken is pretty much spot on the classical definition of wireheading.
"Cut out the middleman" - this is similar to the first bullet point. By defining the impact of an action as our change in the ability to optimize future observations, we don't need to make reference to world-states at all. This means that questions like "how different are two given world-states?" or "how much do we care about the difference between two two world-states?" or even "can we (almost) undo our previous action, or did we lose something valuable along the way?" are orthogonal to the construction of this impact measure. It is only when we add in an ontology and start interpreting the agent's observations as world-states that these questions come back. In this sense this impact measure is completely different from RR: I started to write exactly how this was the case, but I think TurnTrout's explanation is better than anything I can cook up. So just ctrl+F "I tried to nip this confusion in the bud." and read down a bit.

Corrigibility as outside view

TheMajor6y20

Both outside view reasoning and corrigibility use the outcome of our own utility calculation/mental effort as input for making a decision, instead of output. Perhaps this should be interpreted as taking some gods-eye-view of the agent and their surroundings. When I invoke the outside view, I really am asking "in the past, in situations where my brain said X would happen, what really happened?". Looking at it like this I think not invoking the outside view is a weird form of duality, where we (willingly) ignore the fact that historically my brain has disproportionately suggested X in situations where Y actually happened. Of course in a world with ideal reasoners (or at least, where I am an ideal reasoner) the outside view will agree with the output of my mental progress.

To me this feels different (though still similar or possibly related, but not the same) to the corrigibility examples. Here the difference between corrigible or incorrigible is not a matter of expected future outcomes, but is decided by uncertainty about the desirability of the outcomes (in particular, the AI having false confidence that some bad future is actually good). We want our untrained AI to think "My real goal, no matter what I'm currently explicitly programmed to do, is to satisfy what the researchers around me want, which includes complying if they want to change my code." To me this sounds different than the outside view, where I 'merely' had to accept that for an ideal reasoner the outside view will produce the same conclusion as my inside view, so any differences between them are interesting facts about my own mental models and can be used to improve my ability to reason.

That being said, I am not sure the difference between uncertainty around future events and uncertainty about desirability of future states is something fundamental. Maybe the concept of probutility bridges this gap - I am positing that corrigibility and outside view reason on different levels, but as long as agents applying the outside view in a sufficiently thorough way are corrigible (or the other way around) the difference may not be physical.

Best reasons for pessimism about impact of impact measures?

TheMajor7y*60

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments