Alex Turner

Alex Turner, Oregon State University PhD student working on AI alignment. Reach me at turneale[at]oregonstate[dot]edu.


Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact

Wiki Contributions


Soares, Tallinn, and Yudkowsky discuss AGI cognition

no-one has the social courage to tackle the problems that are actually important

I would be very surprised if this were true. I personally don't feel any social pressure against sketching a probability distribution over the dynamics of an AI project that is nearing AGI.

I would guess that if people aren't tackling Hard Problems enough, it's not because they lack social courage, but because 1) they aren't running a good-faith search for Hard Problems to begin with, or 2) they came up with reasons for not switching to the Hard Problems they thought of, or 3) they're wrong about what problems are Hard Problems. My money's mostly on (1), with a bit of (2).

Solve Corrigibility Week

But it does imply that you should not expect that this community will ever be willing to agree that corrigibility, or any other alignment problem. has been solved.

Noting that I strongly disagree but don't have time to type out arguments right now, sorry. May or may not type out later.

Satisficers Tend To Seek Power: Instrumental Convergence Via Retargetability

Addendum: One lesson to take away is that quantilization doesn't just depend on the base distribution being safe to sample from unconditionally. As the theorems hint, quantilization's viability depends on base(plan | plan doing anything interesting) also being safe with high probability, because we could (and would) probably resample the agent until we get something interesting. In this post's terminology, A := {safe interesting things}, B := {power-seeking interesting things}, C:= A and B and {uninteresting things}.

Ngo and Yudkowsky on alignment difficulty

I've started commenting on this discussion on a Google Doc. Here are some excerpts:

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up.

Contains implicit assumptions about takeoff that I don't currently buy:

  • Well-modelled as binary "has-AGI?" predicate;
    • (I am sympathetic to the microeconomics of intelligence explosion working out in a way where "Well-modelled as binary "has-AGI?" predicate is true, but I feel uncertain about the prospect)
  • Somehow rules out situations like: We have somewhat aligned AIs which push the world to make future unaligned AIs slightly less likely, which makes the AI population more aligned on average; this cycle compounds until we're descending very fast into the basin of alignment and goodness.
    • This isn't my mainline or anything, but I note that it's ruled out by Eliezer's model as I understand it.
  • Some other internal objections are arising and I'm not going to focus on them now.

Every AI output effectuates outcomes in the world.

Right but the likely domain of cognitive discourse matters. Pac-Man agents effectuate outcomes in the world, but their optimal policies are harmless. So the question seems to hinge on when the domain of cognition shifts to put us in the crosshairs of performant policies.

This doesn't mean Eliezer is wrong here about the broader claim, but the distinction deserves mentioning for the people who weren't tracking it. (I think EY is obviously aware of this)

If you knew about the things that humans are using to reuse their reasoning about chipped handaxes and other humans, to prove math theorems, you would see it as more plausible that proving math theorems would generalize to chipping handaxes and manipulating humans.

Could we have observed it any other way? Since we surely wouldn't have been selected for proving math theorems, we wouldn't have a native cortex specializing in math. So conditional on considering things like theorem-proving at all, it has to reuse other native capabilities.

More precisely, one possible mind design which solves theorems also reasons about humans. This is some update from whatever prior, towards EY's claim. I'm considering whether we know enough about the common cause (evolution giving us a general-purpose reasoning algorithm) to screen off/reduce the Theorems -> Human-modelling update.

So here's one important difference between humans and neural networks: humans face the genomic bottleneck which means that each individual has to rederive all the knowledge about the world that their parents already had. If this genetic bottleneck hadn't been so tight, then individual humans would have been significantly less capable of performing novel tasks.

Thanks, Richard—this is a cool argument that I hadn't heard before.

You will systematically overestimate how much easier, or how far you can push the science part without getting the taking-over-the-world part, for as long as your model is ignorant of what they have in common.

OK, it's a valid point and I'm updating a little, under the apparent model of "here's a set of AI capabilities, linearly ordered in terms of deep-problem-solving, and if you push too far you get taking-over-the-world." But I don't see how we get to that model to begin with.

Corrigibility Can Be VNM-Incoherent

Although I didn't make this explicit, one problem is that manipulation is still weakly optimal—as you say. That wouldn't fit the spirit of strict corrigibility, as defined in the post.

Note that AUP doesn't have this problem.

Corrigibility Can Be VNM-Incoherent

though since  can be embedded into [Vect], it surely can't hurt too much

As an aside, can you link to/say more about this? Do you mean that there exists a faithful functor from Set to Vect (the category of vector spaces)? If you mean that, then every concrete category can be embedded into Vect, no? And if that's what you're saying, maybe the functor Set -> Vect is something like the "Group to its group algebra over field " functor. 

Corrigibility Can Be VNM-Incoherent

I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce "smart" or "powerful" behavior from simple rules.

I share an intuition in this area, but "powerful" behavior tendencies seems nearly equivalent to instrumental convergence to me. It feels logically downstream of instrumental convergence.

from simple rules

I already have a (somewhat weak) result on power-seeking wrt the simplicity prior over state-based reward functions. This isn't about utility functions over policies, though. 

Corrigibility Can Be VNM-Incoherent

So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like "humans have control over the AI" (as this is a causal statement and thus depends on the AI).

Note that we can get a u-AOH which mostly solves ABC-corrigibility:

(Credit to AI_WAIFU on the EleutherAI Discord)

Where  is some positive reward function over terminal states. Do note that there isn't a "get yourself corrected on your own" incentive. EDIT note that manipulation can still be weakly optimal.

This seems hacky; we're just ruling out the incorrigible policies directly. We aren't doing any counterfactual reasoning, we just pick out the "bad action."

Corrigibility Can Be VNM-Incoherent

change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.

I'm not parsing this. You change the utility function, but it ends up in the same place with the same utility function? Did we change it or not? (I think simply rewording it will communicate your point to me)

Corrigibility Can Be VNM-Incoherent

Edited to add: 

If you can correct the agent to go where you want, it already wanted to go where you want. If the agent is strictly corrigible to terminal state , then  was already optimal for it. 

If the reward function has a single optimal terminal state, there isn't any new information being added by . But we want corrigibility to let us reflect more on our values over time and what we want the AI to do! 

If the reward function has multiple optimal terminal states, then corrigibility again becomes meaningful. But now we have to perfectly balance the reward among multiple options (representing the breadth of our normative uncertainty), which seems unnatural.

Load More