Alex Flint

Independent AI safety researcher

Comments

My take on Michael Littman on "The HCI of HAI"

Yes, I agree, it's difficult to find explicit and specific language for what it is that we would really like to align AI systems with. Thank you for the reply. I would love to read such a story!

My take on Michael Littman on "The HCI of HAI"

Thank you for the kind words.

for example, being aware that human intentions can change -- it's not obvious that the right move is to 'pop out' further and assume there is something 'bigger' that the human's intentions should be aligned with. Could you elaborate on your vision of what you have in mind there?

Well it would definitely be a mistake to build an AI system that extracts human intentions at some fixed point in time and treats them as fixed forever, yes? So it seems to me that it would be better to build systems predicated on that which is the underlying generator of the trajectory of human intentions. When I say "something bigger that human's intentions should be aligned with" I don't mean "physically bigger", I mean "prior to" or "the cause of".

For example, the work concerning corrigibility is about building AI systems that can be modified later, yes? But why is it good to have AI systems that can be modified later? I would say that the implicit claim underlying corrigibility research is that we believe humans have the capacity to, over time, slowly and with many detours, align our own intentions with that which is actually good. So we believe that if we align AI systems with human intentions in a way that is not locked in, then we will be aligning AI systems with that which is actually good. I'm not claiming this is true, just that this is a premise of corrigibility being good.

Another way of looking at it:

Suppose we look at a whole universe with a single human embedded in it, and we ask: where in this system should we look in order to discover the trajectory of this human's intentions as they evolve through time? We might draw a boundary around the human's left foot and ask: can we discover the trajectory of this human's intentions by examining the configuration of this part of the world? We might draw a boundary around the human's head and ask the same question, and I think some would say in this case that the answer is yes, we can discover the human's intentions by examining the configuration of the head. But this is a remarkably strong claim: it asserts that there is no information crucial to tracing the trajectory of the human's intentions over time in any part of the system outside the head. It we draw a boundary around the entire human then this is still an incredibly strong claim. We have a big physical system with constant interactions between regions inside and outside this boundary. We can see that every part of the physical configuration of the region inside the boundary is affected over time by the physical configuration of the region outside the boundary. It is not impossible that all the information relevant to discovering the trajectory of intentions is inside the boundary, but it is a very strong claim to make. ​On what basis might we make such a claim?

One way to defend the claim the trajectory of intentions can be discovered by looking just at the head or just at the whole human is to postulate that intentions are fixed. In that case we could extract the human's current intentions from the physical configuration of their head, which does seem highly plausible, and then the trajectory of intentions over time would just be a constant. But I do not think it is plausible that intentions are fixed like this.

Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

Our choice is not between having humans run the world and having a benevolent god run the world.

Right, I agree that having a benevolent god run the world is not within our choice set.

Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).

Well just to re-state the suggestion in my original post: is this dichotomy between humans running the world or something else running the world really so inescapable? The child in the sand pit does not really run the world, and in an important way the parent also does not run the world -- certainly not from the perspective of the child's whole-life trajectory.

Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

Thank you for this jbash.

Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world

My short response is: Yes, it would be very bad for present-day humanity to have more power than it currently does, since its current level of power is far out of proportion to its level of wisdom and compassion. But it seems to me that there are a small number of humans on this planet who have moved some way in the direction of being fit to run the world, and in time, more humans could move in this direction, and could move further. I would like to build the kind of AI that creates a safe container for movement in this direction, and then fades away as humans in fact become fit to run the world, however long or short that takes. If it turns out not to be possible then the AI should never fade away.

There's no point at which an AI with a practical goal system can tell anything recognizably human, "OK, you've grown up, so I won't interfere if you want to destroy the world, make life miserable for your peers, or whatever".

I think what it means to grow up is to not want to destroy the world or make life miserable for one's peers. I do not think that most biological full-grown humans today have "grown up".

Institutions may have fine-tuned how they restrict individual agency. They may have managed to do it more when it helps and less when it hurts. But they haven't given it up. Institutions don't make individual adults sovereign, not even over themselves and definitely not in any matter that affects others.

Well just to state my position on this without arguing for it: my sense is that institutions should make individual adults sovereign if and when they grow up in the sense I've alluded to above. Very few currently living humans meet this bar in my opinion. In particular, I do not think that I meet this bar.

Not unless you deliberately modify them to the point where the word "human" becomes unreasonable.

True, but whether or not the word "human" is a reasonable description of what a person becomes when they become fit to run the world, the question is really: can humans become fit to run the world, and should they? Based on the few individuals I've spent time with who seem, in my estimation, to have moved some way in the direction of being fit to the run world, I'd say: yes and yes.

Reflections on Larks’ 2020 AI alignment literature review

I very much agree with these two:

On the other hand, there are lots of people who really do want to help, for the right reason. So if growth is the goal, helping these people out seems like just an obvious thing to do

So I think there is a lot of room for growth, by just helping the people who are already involved and trying.

Reflections on Larks’ 2020 AI alignment literature review

Thank you for this thoughtful comment Linda -- writing this replying has helped me to clarify my own thinking on growth and depth. My basic sense is this:

If I meet someone who really wants to help out with AI safety, I want to help them to do that, basically without reservation, regardless of their skill, experience, etc. My sense is that we have a huge and growing challenge in navigating the development of advanced AI, and there is just no shortage of work to do, though it can at first be quite difficult to find. So when I meet individuals, I will try to help them find out how to really help out. There is no need for me to judge whether a particular person really wants to help out or not; I simply help them see how they can help out, and those who want to help out will proceed. Those who do not want to help out will not proceed, and that's fine too -- there are plenty of good reasons for a person to not want to dive head-first into AI safety.

But it's different when I consider setting up incentives, which is what @Larks was writing about:

My basic model for AI safety success is this: Identify interesting problems. As a byproduct this draws new people into the field through altruism, nerd-sniping, apparent tractability. Solve interesting problems. As a byproduct this draws new people into the field through credibility and prestige.

I'm quite concerned about "drawing people into the field through credibility and prestige" and even about "drawing people into the field through altruism, nerd-sniping, and apparent tractability". The issue is not the people who genuinely want to help out, whom I consider to be a boon to the field regardless of their skill or experience. The issue is twofold:

  1. Drawing people who are not particularly interested in helping out into the field via incentives (credibility, prestige, etc).
  2. Tempting those who do really want to help out and are already actually helping out to instead pursue incentives (credibility, prestige, etc).

So I'm not skeptical of growth via helping individuals, I'm skeptical of growth via incentives.

Belief Functions And Decision Theory

Ah this is helpful, thank you.

So let's say I'm estimating the position of a train on a straight section of track as a single real number and I want to do an update each time I receive a noisy measurement of the train's position. Under the theory you're laying out here I might have, say, three Gaussians N(0, 1), N(1, 10), N(4, 6), and rather than updating a single pdf over the position of the train, I'm updating measures associated with each of these three pdf. Is that roughly correct?

(I realize this isn't exactly a great example of how to use this theory since train positions are perfectly realizable, but I just wanted to start somewhere familiar to me.)

Do you by chance have any worked examples where you go through the update procedure for some concrete prior and observation? If not, do you have any suggestions for what would be a good toy problem where I could work through an update at a very concrete level?

Belief Functions And Decision Theory

Thank you for your work both in developing this theory and putting together this heroic write-up! It's really a lot of work to write all this stuff out.

I am interested in understanding the thing you're driving at here, but I'm finding it difficult to navigate because I don't have much of a sense for where the definitions are heading towards. I'm really looking for an explanation of what exactly is made possible by this theory, so that as I digest each of the definitions I have a sense for where this is all heading.

My current understanding is that this is all in service of working in unrealizable settings where the true environment is not in your hypothesis class. Is that correct? What exactly does an infrabayesian agent do to cope with unrealizability?

Reflections on Larks’ 2020 AI alignment literature review

Yeah so to be clear, I do actually think strategy research is pretty important, I just notice that in practice most of the strategy write-ups that I actually read do not actually enlighten me very much, whereas it's not so uncommon to read technical write-ups that seem to really move our understanding forward. I guess it's more that doing truly useful strategy research is just ultra difficult. I do think that, for example, some of Bostrom's and Yudkowsky's early strategy write-ups were ultra useful and important.

Load More