# TurnTrout's shortform feed

New Comment

I think instrumental convergence also occurs in the model space for machine learning. For example, many different architectures likely learn edge detectors in order to minimize classification loss on MNIST. But wait - you'd also learn edge detectors to maximize classification loss on MNIST (loosely, getting 0% on a multiple-choice exam requires knowing all of the right answers). I bet you'd learn these features for a wide range of cost functions. I wonder if that's already been empirically investigated?

And, same for adversarial features. And perhaps, same for mesa optimizers (understanding how to stop mesa optimizers from being instrumentally convergent seems closely related to solving inner alignment).

A lot of examples of this sort of stuff show up in OpenAI clarity's circuits analysis work. In fact, this is precisely their Universality hypothesis. See also my discussion here.

Earlier today, I was preparing for an interview. I warmed up by replying stream-of-consciousness to imaginary questions I thought they might ask. Seemed worth putting here.

What do you think about AI timelines?

I’ve obviously got a lot of uncertainty. I’ve got a bimodal distribution, binning into “DL is basically sufficient and we need at most 1 big new insight to get to AGI” and “we need more than 1 big insight”

So the first bin has most of the probability in the 10-20 years from now, and the second is more like 45-80 years, with positive skew.

Some things driving my uncertainty are, well, a lot. One thing  that drives how things turn out (but not really  how fast we’ll get there) is: will we be able to tell we’re close 3+ years in advance, and if so, how quickly will the labs react? Gwern Branwen made a point a few months ago, which is like, OAI has really been validated on this scaling hypothesis, and no one else is really betting big on it because they’re stubborn/incentives/etc, despite the amazing progress from scaling. If that’s true, then even if it's getting pretty clear that one approach is working better, we might see a slower pivot and have a more unipolar scenario.

I feel dissatisfied with pontificating like this, though, because there are so many considerations pulling so many different ways. I think one of the best things we can do right now is to identify key considerations. There was work on expert models that showed that training simple featurized linear models often beat domain experts, quite soundly. It turned out that most of the work the experts did was locating the right features, and not necessarily assigning very good weights to those features.

So one key consideration I recently read, IMO, was Evan Hubinger talking about how homogeneity of AI systems: if they’re all pretty similarly structured, they’re plausibly roughly equally aligned, which would really decrease the probability of aligned vs unaligned AGIs duking it out.

What do you think the alignment community is getting wrong?

When I started thinking about alignment, I had this deep respect for everything ever written, like I thought the people were so smart (which they generally are) and the content was polished and thoroughly viewed through many different frames (which it wasn’t/isn’t). I think the field is still young enough that: in our research, we should be executing higher-variance cognitive moves, trying things and breaking things and coming up with new frames. Think about ideas from new perspectives.

I think right now, a lot of people are really optimizing for legibility and defensibility. I think I do that more than I want/should. Usually the “non-defensibility” stage lasts the first 1-2 months on a new paper, and then you have to defend thoughts. This can make sense for individuals, and it should be short some of the time, but as a population I wish defensibility weren’t as big of a deal for people / me. MIRI might be better at avoiding this issue, but a not-really-defensible intuition I have is that they’re freer in thought, but within the MIRI paradigm, if that makes sense. Maybe that opinion would change if I talked with them more.

Anyways, I think many of the people who do the best work aren’t optimizing for this.

# How might we align AGI without relying on interpretability?

I'm currently pessimistic about the prospect. But it seems worth thinking about, because wouldn't it be such an amazing work-around?

My first idea straddles the border between contrived and intriguing. Consider some AGI-capable ML architecture, and imagine its  parameter space being 3-colored as follows:

• Gray if the parameter vector+training process+other initial conditions leads to a nothingburger (a non-functional model)
• Red if the parameter vector+... leads to a misaligned or deceptive AI
• Blue if the learned network's cognition is "safe" or "aligned" in some reasonable way

(This is a simplification, but let's roll with it)

And then if you could somehow reason about which parts of  weren't red, you could ensure that no deception ever occurs. That is, you might have very little idea what cognition the learned network implements, but magically somehow you have strong a priori / theoretical reasoning which ensures that whatever the cognition is, it's safe.

The contrived part is that you could just say "well, if we could wave a wand and produce an is-impact-aligned predicate, of course we could solve alignment." True, true.

But the intriguing part is that it doesn't seem totally impossible to me that we get some way of reasoning (at least statistically) about the networks and cognition produced by a given learning setup. See also: the power-seeking theorems, natural abstraction hypothesis, feature universality a la Olah's circuits agenda...

My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I'm playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I'll lose the game. I predict that I'll lose, even though I can't predict my opponent's (optimal) moves - otherwise I'd probably be that good myself.

My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even without saying what particular actions these policies take to get there. I may not even be able to compute a single optimal policy for a single non-trivial objective, but I can still reason about the statistical tendencies of optimal policies.

I never thought I'd be seriously testing the reasoning abilities of an AI in 2020

Looking back, history feels easy to predict; hindsight + the hard work of historians makes it (feel) easy to pinpoint the key portents. Given what we think about AI risk, in hindsight, might this have been the most disturbing development of 2020 thus far?

I personally lean towards "no", because this scaling seemed somewhat predictable from GPT-2 (flag - possible hindsight bias), and because 2020 has been so awful so far. But it seems possible, at least. I don't really know what update GPT-3 is to my AI risk estimates & timelines.

DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in https://www.gwern.net/newsletter/2019/13#what-progress & https://www.gwern.net/newsletter/2020/05#gpt-3 . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?

I'd like to see research exploring the relevance of intragenomic conflict to AI alignment research. Intragenomic conflict constitutes an in-the-wild example of misalignment, where conflict arises "within an agent" even though the agent's genes have strong instrumental incentives to work together (they share the same body).

In an interesting parallel to John Wentworth's Fixing the Good Regulator Theorem, I have an MDP result that says:

Suppose we're playing a game where I give you a reward function and you give me its optimal value function in the MDP. If you let me do this for  reward functions (one for each state in the environment), and you're able to provide the optimal value function for each, then you know enough to reconstruct the entire environment (up to isomorphism).

Roughly: being able to complete linearly many tasks in the state space means you have enough information to model the entire environment.

Are there any alignment techniques which would benefit from the supervisor having a severed corpus callosum, or otherwise atypical neuroanatomy? Usually theoretical alignment discourse focuses on the supervisor's competence / intelligence. Perhaps there are other, more niche considerations.

An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.

That's an interesting point.

ARCHES distinguishes between single-agent / single-user and single-agent/multi-user alignment scenarios. Given assumptions like "everyone in society is VNM-rational" and "societal preferences should also follow VNM rationality", and "if everyone wants a thing, society also wants the thing", Harsanyi's utilitarian theorem shows that the societal utility function is a linear non-negative weighted combination of everyone's utilities. So, in a very narrow (and unrealistic) setting, Harsanyi's theorem tells you how the single-multi solution is built from the single-single solutions.

This obviously doesn't actually solve either alignment problem. But, it seems like an interesting parallel for what we might eventually want.

Dylan: There’s one example that I think about, which is, say, you’re cooperating with an AI system playing chess. You start working with that AI system, and you discover that if you listen to its suggestions, 90% of the time, it’s actually suggesting the wrong move or a bad move. Would you call that system value-aligned?

Lucas: No, I would not.

Dylan: I think most people wouldn’t. Now, what if I told you that that program was actually implemented as a search that’s using the correct goal test? It actually turns out that if it’s within 10 steps of a winning play, it always finds that for you, but because of computational limitations, it usually doesn’t. Now, is the system value-aligned? I think it’s a little harder to tell here. What I do find is that when I tell people the story, and I start off with the search algorithm with the correct goal test, they almost always say that that is value-aligned but stupid.

There’s an interesting thing going on here, which is we’re not totally sure what the target we’re shooting for is. You can take this thought experiment and push it further. Supposed you’re doing that search, but, now, it says it’s heuristic search that uses the correct goal test but has an adversarially chosen heuristic function. Would that be a value-aligned system? Again, I’m not sure. If the heuristic was adversarially chosen, I’d say probably not. If the heuristic just happened to be bad, then I’m not sure.

Consider the optimizer/optimized distinction: the AI assistant is better described as optimized to either help or stop you from winning the game. This optimization may or may not have been carried out by a process which is "aligned" with you; I think that ascribing intent alignment to the assistant's creator makes more sense. In terms of the adversarial heuristic case, intent alignment seems unlikely.

But, this also feels like passing the buck – hoping that at some point in history, there existed something to which we are comfortable ascribing alignment and responsibility.

We can imagine aliens building a superintelligent agent which helps them get what they want. This is a special case of aliens inventing tools. What kind of general process should these aliens use – how should they go about designing such an agent?

Assume that these aliens want things in the colloquial sense (not that they’re eg nontrivially VNM EU maximizers) and that a reasonable observer would say they’re closer to being rational than antirational. Then it seems[1] like these aliens eventually steer towards reflectively coherent rationality (provided they don’t blow themselves to hell before they get there): given time, they tend to act to get what they want, and act to become more rational. But, they aren’t fully “rational”, and they want to build a smart thing that helps them. What should they do?

In this situation, it seems like they should build an agent which empowers them & increases their flexible control over the future, since they don’t fully know what they want now. Lots of flexible control means they can better error-correct and preserve value for what they end up believing they actually want. This also protects them from catastrophe and unaligned competitor agents.

1. I don’t know if this is formally and literally always true, I’m just trying to gesture at an intuition about what kind of agentic process these aliens are. ↩︎

Very rough idea

In 2018, I started thinking about corrigibility as "being the kind of agent lots of agents would be happy to have activated". This seems really close to a more ambitious version of what AUP tries to do (not be catastrophic for most agents).

I wonder if you could build an agent that rewrites itself / makes an agent which would tailor the AU landscape towards its creators' interests, under a wide distribution of creator agent goals/rationalities/capabilities. And maybe you then get a kind of generalization, where most simple algorithms which solve this solve ambitious AI alignment in full generality.

The answer to this seems obvious in isolation: shaping helps with credit assignment, rescaling doesn't (and might complicate certain methods in the advantage vs Q-value way). But I feel like maybe there's an important interaction here that could inform a mathematical theory of how a reward signal guides learners through model space?

Reasoning about learned policies via formal theorems on the power-seeking incentives of optimal policies

One way instrumental subgoals might arise in actual learned policies: we train a proto-AGI reinforcement learning agent with a curriculum including a variety of small subtasks. The current theorems show sufficient conditions for power-seeking tending to be optimal in fully-observable environments; many environments meet these sufficient conditions; optimal policies aren't hard to compute for the subtasks. One highly transferable heuristic would therefore be to gain power in new environments, and then figure out what to do for the specific goal at hand. This may or may not take the form of an explicit mesa-objective embedded in e.g. the policy network.

Later, the heuristic has the agent seek power for the "real world" environment.

I prompted GPT-3 with modified versions of Eliezer's Beisutsukai stories, where I modified the "class project" to be about solving intent alignment instead of quantum gravity.

... Taji looked over his sheets. "Okay, I think we've got to assume that every avenue that Eld science was trying is a blind alley, or they would have found it. And if this is possible to do in one month, the answer must be, in some sense, elegant. So no human mistake models. If we start doing anything that looks like we should call it 'utility function patching', we'd better stop."

He turned to Brennan. "Do you have any ideas?"

"I don't think so," she said. She glanced at her paper, frowning. "There are two problems here."

One problem was simple enough: she didn't understand what it meant for an AI to "want" to help a human. This was hard because as far as she could tell, there hadn't been much research done on this question. It seemed to be something that only humans did, but there wasn't any evidence that this was true. What was it about wanting? Was it a biological drive or a social instinct? A combination of both? More importantly, since it was a human concept, how could she measure if an AI behaved the way humans would in certain situations when the AI was obviously not human?

Then there was the question of practicality. The most elegant solution was not always the best. This question—could an AI desire to help its overseer?—was highly suited to explore this problem, but it still wouldn't be a real test. Humans operated on a "helpfulness scale" which started from things like cashiers wishing you "have a nice day" and descending to Nazi concentration camp guards.

She wanted to craft an AI that would at least be a good clerk in today's society. That was hardly an ambitious goal for a team of four highly intelligent teenagers. And yet, and yet...

"We can look at this differently," Hyori said. "People are born with a blank mind. We can shape them however we want. AIs on the other hand, are born with 800 terabytes of behavioral examples from the moment they're activated. The only data we have about unprogrammed AIs is that they either stay still or randomly move around. All this ... it's not making any confident in how easy this will be." [?]

Brennan stopped writing and turned to look at her, frowning. "So what are you saying?"

"I don't want to approach this problem by trying to divert the AI from its goal," she said. "What if, instead of changing the mind of an AI, we instead changed the environment that an AI found itself in?"

The team fell silent.

Styrlyn broke the silence. "Uh..."

"What I mean is," she said, "what if, instead of trying to divert the AI from one task, we created a situation where accomplishing two tasks would be more beneficial than accomplishing just one? We don't need to patch new programs into the mind of an AI to make it want to help us. We can literally make helping us the most logical decision for it."

Transparency Q: how hard would it be to ensure a neural network doesn't learn any explicit NANDs?