Alex Turner

Alex Turner, Oregon State University PhD student working on AI alignment.


Reframing Impact


Forecasting Thread: AI Timelines

I've also never really understood the resistance to why current models of AI are incapable of AGI.  Sure, we don't have AGI with current models, but how do we know it isn't a question of scale?  Our brains are quite efficient, but the total energy consumption is comparable to that of a light bulb.  I find it very hard to believe that a server farm in an Amazon, Microsoft, or Google Datacenter would be incapable of running the final AGI algorithm.  And for all the talk of the complexity in the brain, each neuron is agonizingly slow (200-300Hz).

First, you ask why it isn't a question of scale. But then you seem to wonder why we need any more scaling? This seems to mix up two questions: can current hardware support AGI for some learning paradigm, and can it support AGI for the deep learning paradigm?

Matt Botvinick on the spontaneous emergence of learning algorithms

E.g. TurnTrout has done a lot of self-learning from textbooks and probably has better advice [for learning RL]

I have been summoned! I've read a few RL textbooks... unfortunately, they're either a) very boring, b) very old, or c) very superficial. I've read:

  • Reinforcement Learning by Sutton & Barto (my book review)
    • Nice book for learning the basics. Best textbook I've read for RL, but that's not saying much.
    • Superficial, not comprehensive, somewhat outdated circa 2018; a good chunk was focused on older techniques I never/rarely read about again, like SARSA and exponential feature decay for credit assignment. The closest I remember them getting to DRL was when they discussed the challenges faced by function approximators.
  • AI: A Modern Approach 3e by Russell & Norvig (my book review)
    • Engaging and clear, but most of the book wasn't about RL. Outdated, but 4e is out now and maybe it's better.
  • Markov Decision Processes by Puterman
    • Thorough, theoretical, very old, and very boring. Formal and dry. It was written decades ago, so obviously no mention of Deep RL.
  • Neuro-Dynamic Programming by Tsitsiklis
    • When I was a wee second-year grad student, I was independently recommended this book by several senior researchers. Apparently it's a classic. It's very dry and was written in 1996. Pass.

OpenAI's several-page web tutorial Spinning Up with Deep RL is somehow the most useful beginning RL material I've seen, outside of actually taking a class. Kinda sad.

So when I ask my brain things like "how do I know about bandits?", the result isn't "because I read it in {textbook #23}", but rather "because I worked on different tree search variants my first summer of grad school" or "because I took a class". I think most of my RL knowledge has come from:

  1. My own theoretical RL research
    1. the fastest way for me to figure out a chunk of relevant MDP theory is often just to derive it myself
  2. Watercooler chats with other grad students

Sorry to say that I don't have clear pointers to good material. 

Do what we mean vs. do what we say

I liked this post when it came out, and I like it even more now. This also brings to mind Paul's more recent Inaccessible Information.

Developmental Stages of GPTs

What is the formal definition of 'power seeking'?

Great question. One thing you could say is that an action is power-seeking compared to another, if your expected (non-dominated subgraph; see Figure 19) power is greater for that action than for the other. 

Power is kinda weird when defined for optimal agents, as you say - when , POWER can only decrease. See Power as Easily Exploitable Opportunities for more on this.

My understanding of figure 7 of your paper indicates that cycle reachability cannot be a sufficient condition.

Shortly after Theorem 19, the paper says: "In appendix C.6.2, we extend this reasoning to k-cycles (k >1) via theorem 53 and explain how theorem19 correctly handles fig. 7". In particular, see Figure 19.

The key insight is that Theorem 19 talks about how many agents end up in a set of terminal states, not how many go through a state to get there. If you have two states with disjoint reachable terminal state sets, you can reason about the phenomenon pretty easily. Practically speaking, this should often suffice: for example, the off-switch state is disjoint from everything else.

If not, you can sometimes consider the non-dominated subgraph in order to regain disjointness. This isn't in the main part of the paper, but basically you toss out transitions which aren't part of a trajectory which is strictly optimal for some reward function. Figure 19 gives an example of this.

The main idea, though, is that you're reasoning about what the agent's end goals tend to be, and then say "it's going to pursue some way of getting there with much higher probability, compared to this small set of terminal states (ie shutdown)". Theorem 17 tells us that in the limit, cycle reachability totally controls POWER. 

I think I still haven't clearly communicated all my mental models here, but I figured I'd write a reply now while I update the paper.

Thank you for these comments, by the way. You're pointing out important underspecifications. :)

My philosophy is that aligned/general is OK based on a shared (?) premise that,

I think one problem is that power-seeking agents are generally not that corrigible, which means outcomes are extremely sensitive to the initial specification.

Developmental Stages of GPTs

If there's a collection of 'turned-off' terminal states where the agent receives no further reward for all time then every optimized policy will try to avoid such a state.

To clarify, I don't assume that. The terminal states, even those representing the off-switch, also have their reward drawn from the same distribution. When you distribute reward IID over states, the off-state is in fact optimal for some low-measure subset of reward functions.

But, maybe you're saying "for realistic distributions, the agent won't get any reward for being shut off and therefore  won't ever let itself be shut off". I agree, and this kind of reasoning is captured by Theorem 3 of Generalizing the Power-Seeking Theorems. The problem is that this is just a narrow example of the more general phenomenon. What if we add transient "obedience" rewards, what then? For some level of farsightedness ( close enough to 1), the agent will still disobey, and simultaneously disobedience gives it more control over the future.

The paper doesn't draw the causal diagram "Power  instrumental convergence", it gives sufficient conditions for power-seeking being instrumentally convergent. Cycle reachability preservation is one of those conditions.

In general, I'd suspect that there are goals we could give the agent that significantly reduce our gain. However, I'd also suspect the opposite.

Yes, right. The point isn't that alignment is impossible, but that you have to hit a low-measure set of goals which will give you aligned or non-power-seeking behavior. The paper helps motivate why alignment is generically hard and catastrophic if you fail. 

It seems reasonable to argue that we would if we could guarantee 

Yes, if , introduce the agent. You can formalize a kind of "alignment capability" by introducing a joint distribution over the human's goals and the induced agent goals (preliminary Overleaf notes). So, if we had goal X, we'd implement an agent with goal X', and so on. You then take our expected optimal value under this distribution and find whether you're good at alignment, or whether you're bad and you'll build agents whose optimal policies tend to obstruct you.

There might be a way to argue over randomness and say this would double our gain. 

The doubling depends on the environment structure. There are game trees and reward functions where this holds, and some where it doesn't. 

More speculatively, what if ?

If the rewards are -close in sup-norm, then you can get nice regret bounds, sure. 

Developmental Stages of GPTs

Great observation. Similarly, a hypothesis called "Maximum Causal Entropy" once claimed that physical systems involving intelligent actors tended tended towards states where the future could be specialized towards many different final states, and that maybe this was even part of what intelligence was. However, people objected: (monogamous) individuals don't perpetually maximize their potential partners -- they actually pick a partner, eventually. 

My position on the issue is: most agents steer towards states which afford them greater power, and sometimes most agents give up that power to achieve their specialized goals. The point, however, is that they end up in the high-power states at some point in time along their optimal trajectory. I imagine that this is sufficient for the  catastrophic power-stealing incentives: the AI only has to disempower us once for things to go irreversibly wrong.

Developmental Stages of GPTs

it seems like a response of the form "we have support for IC, not just in random minds, but also for random reward functions" has not responded to the critique and should not be expected to be convincing to that person.

I agree that the paper should not be viewed as anything but slight Bayesian evidence for the difficulty of real objective distributions. IIRC I was trying to reply to the point of "but how do we know IC even exists?" with "well, now we can say formal things about it and show that it exists generically, but (among other limitations) we don't (formally) know how hard it is to avoid if you try". 

I think I agree with most of what you're arguing.

Developmental Stages of GPTs

Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property? 

Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results. 

I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.

ETA: also, i was referring to the point you made when i said 

“the results don't prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”

Conclusion to 'Reframing Impact'

I'm very glad you enjoyed it! 

I've never read the "Towards a new Impact Measure" post, but I assume doing so is redundant now since this sequence is the 'updated' version.

I'd say so, yes. 

Attainable Utility Preservation: Scaling to Superhuman

I realize that impact measures always lead to a tradeoff between safety and performance competitiveness. 

For optimal policies, yes. In practice, not always - in SafeLife, AUP often had ~50% improved performance on the original task, compared to just naive reward maximization with the same algorithm!

it seems to penalize reasonable long-term thinking more than the formulas where .

Yeah. I'm also pretty sympathetic to arguments by Rohin and others that the  variant isn't quite right in general; maybe there's a better way to formalize "do the thing without gaining power to do it" wrt the agent's own goal.

whether the open problem of the AUP-agent tricking the penalty by restricting its future behavior is actually a symptom of the non-embedded agency model.

I think this is plausible, yep. This is why I think it's somewhat more likely than not there's no clean way to solve this; however, I haven't even thought very hard about how to solve the problem yet.

More generally, if you don't consider internal changes in principle, what stops a really powerful agent from reprogramming itself to slip through your penalty?

Depends on how that shows up in the non-embedded formalization, if at all. If it doesn't show up, then the optimal policy won't be able to predict any benefit and won't do it. If it does... I don't know. It might. I'd need to think about it more, because I feel confused about how exactly that would work - what its model of itself is, exactly, and so on. 

Load More