Alex Turner

Oregon State University PhD student working on AI alignment.

Alex Turner's Comments

Inner alignment requires making assumptions about human values

It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can't see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.

I think people should generally be a little more careful about saying "this requires value-laden information". First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say "don't do all these bad things we don't like"!

Second, it's always good to check "would this style of reasoning lead me to conclude solving the easy problem of wireheading is value-laden?":

This isn't an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.

Inner alignment requires making assumptions about human values

I (low-confidence) think that there might be a "choose two" wrt impact measures: large effect, no ontology, no/very limited value assumptions. I see how we might get small good effects without needing a nice pre-specified ontology or info. about human values (AUP; to be discussed in upcoming Reframing Impact posts). I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.

I know this isn't saying why I think this yet, but I'd just like to register this now for later discussion.

Vanessa Kosoy's Shortform

This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user's stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That's because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period...

This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage.

This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I've made for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

There can't be too many things that reduce the expected value of the future by 10%; if there were, there would be no expected value left. So, the prior that any particular thing has such an impact should be quite low.

I don't follow this argument; I also checked the transcript, and I still don't see why I should buy it. Paul said:

A priori you might’ve been like, well, if you’re going to build some AI, you’re probably going to build the AI so it’s trying to do what you want it to do. Probably that’s that. Plus, most things can’t destroy the expected value of the future by 10%. You just can’t have that many things, otherwise there’s not going to be any value left in the end. In particular, if you had 100 such things, then you’d be down to like 1/1000th of your values. 1/10 hundred thousandth? I don’t know, I’m not good at arithmetic.

Anyway, that’s a priori, just aren’t that many things are that bad and it seems like people would try and make AI that’s trying to do what they want.

In my words, the argument is "we agree that the future has nontrivial EV, therefore big negative impacts are a priori unlikely".

But why do we agree about this? Why are we assuming the future can't be that bleak in expectation? I think there are good outside-view arguments to this effect, but that isn't the reasoning here.

Towards a New Impact Measure

I think that this idea behind AUP has fairly obvious applications to human rationality and cooperation, although they aren’t spelled out in this post. This seems like a good candidate for follow-up work.

I'm curious whether these are applications I've started to gesture at in Reframing Impact, or whether what you have in mind as obvious isn't a subset of what I have in mind. I'd be interested in seeing your shortlist.

For more detail, see my exchange in the descendents of this comment - I still mostly agree with my claims about the technical aspects of AUP as presented in this post. Fleshing out these details is also, in my opinion, a good candidate for follow-up work.

Without rereading all of the threads, I'd like to note that I now agree with Daniel about the subhistories issue. I also agree that the formalization in this post is overly confusing and complicated.

Seeking Power is Instrumentally Convergent in MDPs

It seems a common reading of my results is that agents tend to seek out states with higher power. I think this is usually right, but it's false in some cases. Here's an excerpt from the paper:

So, just because a state has more resources, doesn't technically mean the agent will go out of its way to reach it. Here's what the relevant current results say: parts of the future allowing you to reach more terminal states are instrumentally convergent, and the formal POWER contributions of different possibilities are approximately proportionally related to their instrumental convergence. As I said in the paper,

The formalization of power seems reasonable, consistent with intuitions for all toy MDPs examined. The formalization of instrumental convergence also seems correct. Practically, if we want to determine whether an agent might gain power in the real world, one might be wary of concluding that we can simply "imagine'' a relevant MDP and then estimate e.g. the "power contributions'' of certain courses of action. However, any formal calculations of POWER are obviously infeasible for nontrivial environments.

To make predictions using these results, we must combine the intuitive correctness of the power and instrumental convergence formalisms with empirical evidence (from toy models), with intuition (from working with the formal object), and with theorems (like theorem 46, which reaffirms the common-sense prediction that more cycles means asymptotic instrumental convergence, or theorem 26, fully determining the power in time-uniform environments). We can reason, "for avoiding shutdown to not be heavily convergent, the model would have to look like such-and-such, but it almost certainly does not...''.

I think the Tic-Tac-Toe reasoning is a better intuition: it's instrumentally convergent to reach parts of the future which give you more control from your current vantage point. I'm working on expanding the formal results to include some version of this.

The Credit Assignment Problem

However, the critic is learning to predict; it's just that all we need to predict is expected value.

See also muZero. Also note that predicting optimal value formally ties in with predicting a model. In MDPs, you can reconstruct all of the dynamics using just optimal value functions.

Towards a New Impact Measure

Not quite. I'm proposing penalizing it for gaining power, a la my recent post. There's a big difference between "able to get 10 return from my current vantage point" and "I've taken over the planet and can ensure i get 100 return with high probability". We're penalizing it for increasing its ability like that (concretely, see Conservative Agency for an analogous formalization, or if none of this makes sense still, wait till the end of Reframing Impact).

Towards a New Impact Measure

This is my post.

How my thinking has changed

I've spent much of the last year thinking about the pedagogical mistakes I made here, and am writing the Reframing Impact sequence to fix them. While this post recorded my 2018-thinking on impact measurement, I don't think it communicated the key insights well. Of course, I'm glad it seems to have nonetheless proven useful and exciting to some people!

If I were to update this post, it would probably turn into a rehash of Reframing Impact. Instead, I'll just briefly state the argument as I would present it today. I currently think that power-seeking behavior is the worst part of goal-directed agency, incentivizing things like self-preservation and taking-over-the-planet. Unless we assume an "evil" utility function, an agent only seems incentivized to hurt us in order to become more able to achieve its own goals. But... what if the agent's own utility function penalizes it for seeking power? What happens if the agent maximizes a utility function while penalizing itself for becoming more able to maximize that utility function?

This doesn't require knowing anything about human values in particular, nor do we need to pick out privileged parts of the agent's world model as "objects" or anything, nor do we have to disentangle butterfly effects. The agent lives in the same world as us; if we stop it from making waves by gaining a ton of power, we won't systematically get splashed. In fact, it should be extremely easy to penalize power-seeking behavior, because power-seeking is instrumentally convergent. That is, penalizing increasing your ability to e.g. look at blue pixels should also penalize power increases.

The main question in my mind is whether there's an equation that gets exactly what we want here. I think there is, but I'm not totally sure.

Seeking Power is Instrumentally Convergent in MDPs

That makes sense, thanks. I then agree that it isn't always true that actively decreases, but it should generally become harder for us to optimize. This is the difference between a utility decrease and an attainable utility decrease.

Load More