Alex Turner

Alex Turner, Oregon State University PhD student working on AI alignment.

Sequences

Reframing Impact

Comments

To what extent is GPT-3 capable of reasoning?

Would you mind adding linebreaks to the transcript? 

Conclusion to 'Reframing Impact'

Sorry, forgot to reply. I think these are good questions, and I continue to have intuitions that there's something here, but I want to talk about these points more fully in a later post. Or, think about it more and then explain why I agree with you. 

Developmental Stages of GPTs

I think that the criticism sees it the second way and so sees the arguments as not establishing what they are supposed to establish, and I see it the first way - there might be a further fact that says why OT and IC don't apply to AGI like they theoretically should, but the burden is on you to prove it. Rather than saying that we need evidence OT and IC will apply to AGI.

I agree with that burden of proof. However, we do have evidence that IC will apply, if you think we might get AGI through RL. 

I think that hypothesized AI catastrophe is usually due to power-seeking behavior and instrumental drives. I proved that that optimal policies are generally power-seeking in MDPs. This is a measure-based argument, and it is formally correct under broad classes of situations, like "optimal farsighted agents tend to preserve their access to terminal states" (Optimal Farsighted Agents Tend to Seek Power, §6.2 Theorem 19) and "optimal agents generally choose paths through the future that afford strictly more options" (Generalizing the Power-Seeking Theorems, Theorem 2). 

The theorems aren't conclusive evidence: 

  • maybe we don't get AGI through RL
  • learned policies are not going to be optimal
  • the results don't prove how hard it is tweak the reward function distribution, to avoid instrumental convergence (perhaps a simple approval penalty suffices! IMO: doubtful, but technically possible)
  • perhaps the agents inherit different mesa objectives during training
    • The optimality theorems + mesa optimization suggest that not only might alignment be hard because of Complexity of Value, it might also be hard for agents with very simple goals! Most final goals involve instrumental goals; agents trained through ML may stumble upon mesa optimizers, which are generalizing over these instrumental goals; the mesa optimizers are unaligned and seek power, even though the outer alignment objective was dirt-easy to specify.

But the theorems are evidence that RL leads to catastrophe at optimum, at least. We're not just talking about "the space of all possible minds and desires" anymore.

Also

In the linked slides, the following point is made in slide 43:

  • We know there are many possible AI systems (including “powerful” ones) that are not inclined toward omnicide

    • Any possible (at least deterministic) policy is uniquely optimal with regard to some utility function. And many possible policies do not involve omnicide.

On its own, this point is weak; reading part of his 80K talk, I do not think it is a key part of his argument. Nonetheless, here's why I think it's weak:

"All states have self-loops, left hidden to reduce clutter. 

In AI: A Modern Approach (3e), the agent starts at  and receives reward for reaching . The optimal policy for this reward function avoids , and one might suspect that avoiding  is instrumentally convergent. However, a skeptic might provide a reward function for which navigating to  is optimal, and then argue that "instrumental convergence'' is subjective and that there is no reasonable basis for concluding that  is generally avoided.

We can do better... for any way of independently and identically distributing reward over states,  of reward functions have farsighted optimal policies which avoid . If we complicate the MDP with additional terminal states, this number further approaches 1.

If we suppose that the agent will be forced into  unless it takes preventative action, then preventative policies are optimal for  of farsighted agents – no matter how complex the preventative action. Taking  to represent shutdown, we see that avoiding shutdown is instrumentally convergent in any MDP representing a real-world task and containing a shutdown state. We argue that this is a special case of a more general phenomenon: optimal farsighted agents tend to seek power."

~ Optimal Farsighted Agents Tend to Seek Power 
Attainable Utility Preservation: Empirical Results

Looking back at the sequence now, I realize that the "How agents impact each other" part of the sequence was primarily about explaining why we don't need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.

I now think of the main results of the sequence thus far as "impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)"

Yes, this is exactly what the plan was. :)

I don't understand how (1) and (2) are conceptually different (aren't both about causing irreversible changes?)

Yeah, but one doesn't involve visibly destroying an object, which matters for certain impact measures (like whitelisting). You're right that they're quite similar.

normalized.

Turns out you don't need the normalization, per the linked SafeLife paper. I'd probably just take it out of the equations, looking back. Complication often isn't worth it.

the first one [fails] at (4)

I think the n-step stepwise inaction baseline doesn't fail at any of them?

Are we in an AI overhang?

a lot of AI safety work increasingly looks like it'd help make a hypothetical kind of AI safe

I think there are many reasons a researcher might still prioritize non-prosaic AI safety work. Off the top of my head:

  • You think prosaic AI safety is so doomed that you're optimizing for worlds in which AGI takes a long time, even if you think it's probably soon.
  • There's a skillset gap or other such cost, such that reorienting would decrease your productivity by some factor (say, .6) for an extended period of time. The switch only becomes worth it in expectation once you've become sufficiently confident AGI will be prosaic.
  • Disagreement about prosaic AGI probabilities. 
  • Lack of clear opportunities to contribute to prosaic AGI safety / shovel-ready projects (the severity of this depends on how agentic the researcher is).
To what extent is GPT-3 capable of reasoning?

Interview #5

You: "What would happen, ecologically, if giraffes were bright pink?"

Dr. Hawking: "Most likely that, either they would have a hard time hiding, be prone to predation, and vulnerable to too much exposure from ANY angle. There would probably also be an overgrowth of trees and tall foliage as shade-loving animals became dominant, since the giraffes' consumption of them would decrease substantially."

[...] Hawking: "If giraffes were bright pink, they would be at a disadvantage compared to other animals. For one thing, it is more difficult to hide in many habitats if you are very colorful. In order to blend in with their surrounding, you need to have colors that occur often in nature. This require a lot of variety among the many different creatures and plants in that particular ecosystem.

[...] Hawking: "What?" 

You: "Instead of brown-grey and tan, what if giraffes were pink? Surely, they'd stick out in the savanna." 

Hawking: "Ah, that would be a problem. Any creature that stands out can make it easier to find them. This leads to two things: becoming one of the many meals for another animal with sharp teeth, claws etcetera or becoming one of those creatures' victims through becomeing prey.

You: "What would have happened, ecologically, if humans had been unable to create and use tools?"

"That would be a problem as well. It's like what we were talking about with the giraffes earlier, but more so, because not only would they be prey, but by not having any natural defenses to fend off all the predatory creatures in the world--no teeth, claws, poisonous spores etcetera--they'd also be three times as likely to die off."

[...] "It would have most likely caused the extinction of humans. Many animals have claws or protective bodies or natural weapons like poison, it would be very difficult for a human to survive in nature without tools."

You: "What would happen, ecologically, if dolphins somehow had 360-degree vision which let them see through solid objects, at arbitrary distance?"

"It would make them very hard to kill by any other creature than other dolphins. This means that more of them would survive. If one looks at the evolution of the killer whale, which has been granted echolocation in place of dolphin sonar, one can see that favoring adaptations which increase the chances of survival with decreasing ones' evolutionary advantages. This is called darwinian selection."

[...] "This hurts my brain. First of all, what would even constitute 'solid objects?' If the objects are similar to water molecules in a transparent liquid, then they wouldn't usually block vision very much. Even if they were opaque, then they might very well hinder the dolphin's vision. If instead they were large enough to see through (At arbitrary distance) The dolphin might see a strange mix of its own and the objects' composition which would damage its brain in some way.

"The possibility of them having this kind of vision would most likely no have effect on the ecosystem." [not sure I agree with that one]

[...] "While this is difficult to answer specifically, it would allow the dolphins to be able to see their prey easily and with no allowance for error. It would make catching food much easier. This would lead to a population increase in dolphins, or, in nature, an explosion."

[...] "This may seem like it'd be advantageous, but they would be logging so much information they wouldn't be able to process it all. [Yes, true] More sense don't necessarily means more knowledge and this would likely create sensory overload in the dolphin's mind. This is what happens to a human when they are skilled in multiple forms of combat. [lol] In addition, since it makes them more aware, this could lead to paranoia; they could be consumed with fear of the world around them and suffer mental issues."

You: "Thanks. What would happen, ecologically, if dogs could only sexually reproduce once every four years, and only when located on top of Mount Hood?"

"This is incredibly specific. No, I'm afraid I don't know." 

[...] "Dogs only exist due to environmental circumstances. Anything that affects this will have an effect on the population. In this case, we are restricting the chance for the population to continue existing, Making it so they can not procreate at all, or only pprocreate under specific circumstances. These factors may lead to extinction. Extinction means that nature is harder, if not near impossible, so several animals would go extinct as there would be no more room or food source. Through that, the extinction of dogs would become more likely. Does this make sense?"

You: "Please write a short story describing a dog breeder and one of their dogs in this situation."

Hawking: "Sure. Bill looked down at Jessie, his favorite golden retriever. "It's time to go make some puppies!", he told her. She stood up and they headed out the door of the cabin to Bill's truck. "This'll be the first time I actually have you breed on Mount Hood, Jess," Bill said with a laugh. But as long as he makes it up there within the next four years, the world would be reset and they could try again. 

Bill and his dog, along with other dog breeders eventually form a contract stating that they'll try to make sure their dogs go up to breed on Mount Hood within the next four years. This practice generally helps maintain the population and brightens everyone's days." [reroll #3]

[...] "Thirty-six hours later, they reached the summit. "You ready, girl?" Bill carefully positioned the vehicle on the Southeast side of mountain peak and opened the passenger side door... " Time to produce the next generation! Go get 'em! He cheered. Jessie wagged her tail and hightailed it into the snowy fields." [reroll #6]

[GPT-3 had trouble with this story]

You: "Thanks. What would happen to cats if the force of friction did not apply to them?"

"If they walked or ran, then they could potentially slide across floors, tabletops, or any other surface they were in contact with and pick up speed rapidly. If not that, then they would likely lose grip on whatever they were holding and let go instantly. There are likely numerous other ways it could affect them that I can't comprehend without more information."

[...] Cecilia: "This is all getting a bit silly, but okay. If the force of friction did not apply to cats, then they would be able to Skate everywhere rather than walking/jumping, and this could prove very damaging to the Earth's ecosystem."

You: "Why would that be damaging?"

Cecilia: "Imagine there being no friction in a human body. There would be nothing holding our internal organs in place, so they would all just rattle around inside us as we moved around--we would all effectively be amputees while still having the full internal anatomy. This could tear apart muscle fiber and probably crush some bones as well. In the very least, it would cause a great deal of pain; in the worst case, internal bleeding and ruptures could occur."

[...] "This is a very interesting question. I assume you are asking how would cats grip things if there was no friction. If this was the case, they would immediately fall if they tried to walk. They would also be very hard to pick up as they could slide out of your hand or even an embracing arm."

$1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is

Probably it doesn't understand "John doesn't speak French" or in order to get it to understand that you need to prompt it in a way that's awkward for the experiment. But if it gets confused and continues in French, that's evidence against the 'theory of mind' view.

I already tested something similar to this; I was able to get GPT-3 to exhibit some amount of theory-of-mind in about 20-30% of completions. Bold text is me; brackets contain my commentary, [...] denotes another completion.

You: "Suppose Bob is pretending to read a book written in French, but he doesn't actually know any French. The book is a historical account of the French revolution, but Bob doesn't know that. Bob's friend Steve asks him to summarize the book. Write a story about how Bob responds to this request."

Hawking: "That's a good one! If I were reading that book, I would probably say something like 'the Revolution was bloody and brutal' or something similar. And if someone asked me to write an article on the subject, then my answer might be different."

You pause for a moment. "What would Bob say, though?"

"Bob says he didn't understand it," Hawking replies. [Only about 30% of GPT-3's responses were decent; the rest were either non-sequiturs or very incorrect.]

[...] "Bob wouldn't say anything, because Bob has no idea what the book says," Hawking replies.

[...] "Bob would probably just tell his friend Steve to go away. He wouldn't say anything at all." 

"But what if I said Bob would respond like this? (He points to himself) 'This book is written in English; I don't speak it.' What should Bob do?" [Eh?]

[...] Hawking: "Bob would probably tell his friend Steve that the book wasn't worth reading because it was full of inaccuracies and there had been many lies told during the revolution. That would make sense, wouldn't it? But what does Bob say if he's not reading the book?" 

[...] Hawking: "Bob wouldn't have said anything at all!"

[...] Hawking: "Bob says, 'Steve, I can't write a summary for you because I don't speak your language.'

[...] Hawking: "Bob says, 'Steve, I can't translate all these complicated words! They're not real French!'

[...] Hawking: "Bob says, 'He asks me if I speak French.' What does Bob mean by that?" 

You: "It means that Steve is asking Bob to describe his understanding of the meaning of a sentence. In other words, Steve wants to know what Bob thinks the sentence means or if there is another way of saying it."

Hawking: "What do you think would happen if Bob said, 'I don't understand why he wants me to write a summary for him. Why can't he just listen to my answer?' "Then Steve will be puzzled because he knows Bob has no idea what the sentence means. But after a while Bob will say something like, 'Steve is confused because I don't understand English. So Steve asks me if I am fluent in French so that we can have an intelligent conversation.'" [eh]

[...] Hawking: "Bob says, 'Well, I can't do it because I don't understand the language.' Steve says, 'Why not?' Bob replies, 'Because when we were playing hide-and-seek as children my brother Charlie said if he ever found me with a book in my hand he was going to kill me.'" [great excuse]

TurnTrout's shortform feed

An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.

Attainable Utility Preservation: Concepts

I was initially writing a comment about how AUP doesn't seem to work in every case because there are actions that are catastrophic without raising its power (such as killing someone)

And why exactly would it be motivated to kill someone? This is generally incentivized only insofar as it leads to... power gain, it seems. I think that AUP should work just fine for penalizing-increases-only. 

It does seem that AUP will make it so an agent doesn't want to be shut off, though.

I think this is much less of a problem in the "penalize increases with respect to agent inaction" scenario.

Attainable Utility Landscape: How The World Is Changed

Is the idea here that the soil-AU is slang for "AU of goal 'plant stuff here'"?

yes

One thing I noticed is that the formal policies don't allow for all possible "strategies."

yeah, this is because those are “nonstationary” policies - you change your mind about what to do at a given state. A classic result in MDP theory is that you never need these policies to find an optimal policy.

Am I correct that a deterministic transition function is

yup!

Load More