Alex Turner

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network

Thoughts on Corrigibility

The Causes of Power-seeking and Instrumental Convergence

Reframing Impact

Posts

Sorted by New

10TurnTrout's shortform feed

5y

230

45Many arguments for AI x-risk are wrong

2mo

42

49Steering Llama-2 with contrastive activation additions

4mo

23

37Paper: Understanding and Controlling a Maze-Solving Policy Network

6mo

0

45ActAdd: Steering Language Models without Optimization

8mo

2

24Open problems in activation engineering

9mo

2

22Ban development of unpredictable powerful models?

10mo

9

24Mode collapse in RL may be fueled by the update equation

10mo

7

47Think carefully before calling RL policies "agents"

1y

9

116Steering GPT-2-XL by adding an activation vector

1y

63

32Residual stream norms grow exponentially over the forward pass

1y

6

Wiki Contributions

Reinforcement Learning

1y

(+16)

Reinforcement Learning

1y

(+333/-390)

Complexity of Value

1y

(+176/-112)

General Alignment Properties

2y

(+317)

Pages Imported from the Old Wiki

3y

(+9/-5)

Impact Regularization

3y

(+22)

Mild Optimization

4y

(+188)

Impact Regularization

4y

(+95/-32)

Impact Regularization

4y

(+7/-7)

Impact Regularization

4y

(+57)

Comments

Non-myopia stories

Alex Turner22d2-2

As Turntrout has already noted, that does not apply to model-based algorithms, and they 'do optimize the reward':

I think that you still haven't quite grasped what I was saying. Reward is not the optimization target totally applies here. (It was the post itself which only analyzed the model-free case, not that the lesson only applies to the model-free case.)

In the partial quote you provided, I was discussing two specific algorithms which are highly dissimilar to those being discussed here. If (as we were discussing), you're doing MCTS (or "full-blown backwards induction") on reward for the leaf nodes, the system optimizes the reward. That is -- if most of the optimization power comes from explicit search on an explicit reward criterion (as in AIXI), then you're optimizing for reward. If you're doing e.g. AlphaZero, that aggregate system isn't optimizing for reward.

Despite the derision which accompanies your discussion of Reward is not the optimization target, it seems to me that you still do not understand the points I'm trying to communicate. You should be aware that I don't think you understand my views or that post's intended lesson. As I offered before, I'd be open to discussing this more at length if you want clarification.

CC @faul_sname

Reply

Richard Ngo's Shortform

Alex Turner1mo42

Nope! I have basically always enjoyed talking with you, even when we disagree.

Reply

Counting arguments provide no evidence for AI doom

Alex Turner2mo32

As I've noted in all of these comments, people consistently use terminology when making counting style arguments (except perhaps in Joe's report) which rules out the person intending the argument to be about function space. (E.g., people say things like "bits" and "complexity in terms of the world model".)

Aren't these arguments about simplicity, not counting?

Reply

1

Counting arguments provide no evidence for AI doom

Alex Turner2mo40

I think they meant that there is an evidential update from "it's economically useful" upwards on "this way of doing things tends to produce human-desired generalization in general and not just in the specific tasks examined so far."

Perhaps it's easy to consider the same style of reasoning via: "The routes I take home from work are strongly biased towards being short, otherwise I wouldn't have taken them home from work."

Reply

Counting arguments provide no evidence for AI doom

Alex Turner2mo53

Sorry, I do think you raised a valid point! I had read your comment in a different way.

I think I want to have said: aggressively training AI directly on outcome-based tasks ("training it to be agentic", so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it's worth distinguishing between these kinds of argument.

Reply

Richard Ngo's Shortform

Alex Turner2mo5-4

In other words, shard advocates seem so determined to rebut the "rational EU maximizer" picture that they're ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?

Personally, I'm not ignoring that question, and I've written about it (once) in some detail. Less relatedly, I've talked about possible utility function convergence via e.g. A shot at the diamond-alignment problem and my recent comment thread with Wei_Dai.

It's not that there isn't more shard theory content which I could write, it's that I got stuck and burned out before I could get past the 101-level content.

I felt

a) gaslit by "I think everyone already knew this" or even "I already invented this a long time ago" (by people who didn't seem to understand it); and that
b) I wasn't successfully communicating many intuitions;^[1] and
c) it didn't seem as important to make theoretical progress anymore, especially since I hadn't even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network).

So I didn't want to post much on the site anymore because I was sick of it, and decided to just get results empirically.

In terms of its literal content, it basically seems to be a reframing of the "default" stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is "assume they're just a set of heuristics".

I've always read "assume heuristics" as expecting more of an "ensemble of shallow statistical functions" than "a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed." Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed.

^{^}
The 2022 review indicates that more people appreciated the shard theory posts than I realized at the time.

Reply

TurnTrout's shortform feed

Alex Turner2mo30

It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.

Thanks for pointing out that distinction!

Reply

1

Many arguments for AI x-risk are wrong

Alex Turner2mo00

See footnote 5 for a nearby argument which I think is valid:

The strongest argument for reward-maximization which I'm aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that's evidence that "learning setups which work in reality" can come to care about their own training signals.

Reply

Many arguments for AI x-risk are wrong

Alex Turner2mo30

I don't expect the current paradigm will be insufficient (though it seems totally possible). Off the cuff I expect 75% that something like the current paradigm will be sufficient, with some probability that something else happens first. (Note that "something like the current paradigm" doesn't just involve scaling up networks.)

Reply

Many arguments for AI x-risk are wrong

Alex Turner2mo20

"If you don't include attempts to try new stuff in your training data, you won't know what happens if you do new stuff, which means you won't see new stuff as a good opportunity". Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won't be what builds capabilities in the limit.

I'm sympathetic to this argument (and think the paper overall isn't super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That's something new.

Reply