Formal Solution to the Inner Alignment Problem

I interpreted this bit as talking about RL

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

But taking us back out of RL, in a wide neural network with selective attention that enables many qualitatively different forward passes, gradient descent seems to be training the way different models get proposed (i.e. the way attention is allocated), since this happens in a single forward pass, and what we're left with is a modeling routine that is heuristically considering (and later comparing) very different models. And this should include any model that a human would consider.

I think that is main thread of our argument, but now I'm curious if I was totally off the mark about Q-learning and policy gradient.

but overall I think these sorts of differences are pretty minor and shouldn't affect whether these approaches can reach general intelligence or not.

I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won't, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.

a) Does that seem right? and b) Aren't there some policy gradient methods that don't face this problem?

Formal Solution to the Inner Alignment Problem

So, if I understand the paper right, if  was still in the set of top policies at time , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.

This is exactly right.

But I am also reading in the abstract of the paper that

[...] queries to the demonstrator rapidly diminish in frequency

The reason this is consistent is that queries will become infrequent, but they will still be well-timed. The choice of whether to query depends on what the treacherous models are doing. So if the treacherous models wait for a long time,  then we will have a long time where no queries are needed, and as soon as they decide to be treacherous, we thwart that plan with a query to the demonstrator. So it does not imply that

over time, it is likely that  might disappear from the top set


Does this thought experiment look reasonable or have I overlooked something?

Yep, perfectly reasonable!

What about the probability that  is still in the set of top policies at time ?

If we set  small enough, we can make it arbitrarily like that  never leaves the set of top policies.

Formal Solution to the Inner Alignment Problem

That's possible. But it seems like way less of a convergent instrumental goal for agents living in a simulated world-models. Both options--our world optimized by us and our world optimized by a random deceptive model--probably contain very little of value as judged by agents in another random deceptive model.

So yeah, I would say some models would think like this, but I would expect the total weight on models that do to be much lower.

Formal Solution to the Inner Alignment Problem

This is very nice and short!

And to state what you left implicit:

If , then in the setting with no malign hypotheses (which you assume to be safe), 0 is definitely the output, since the malign models can only shift the outcome by , so we assume it is safe to output 0. And likewise with outputting 1.

I'm pretty sure removing those is mostly just a technical complication

One general worry I have about assuming that the deterministic case extends easily to the stochastic case is that a sequence of probabilities that tends to 0 can still have an infinite sum, which is not true when probabilities must , and this sometimes causes trouble. I'm not sure this would raise any issues here--just registering a slightly differing intuition.

Formal Solution to the Inner Alignment Problem

I certainly don't think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.

Okay I think we've switched from talking about Q-learning to talking about policy gradient. (Or we were talking about the latter the whole time, and I didn't notice it). The question that I think is relevant is: how are possible world-models being hypothesized and analyzed? That's something I expect to be done with messy heuristics that sometimes have discontinuities their sequence of outputs. Which means I think that no reasonable DQN is will be generally intelligent (except maybe an enormously wide one attention-based one, such that finding models is more about selective attention at any given step than it is about gradient descent over the whole history).

A policy gradient network, on the other hand, could maybe (after having its parameters updated through gradient descent) become a network that, in a single forward pass, considers diverse world-models (generated with a messy non-local heuristic), and analyzes their plausibility, and then acts. At the end of the day, what we have is an agent modeling world, and we can expect it to consider any model that a human could come up with. (This paragraph also applies to the DQN with a gradient-descent-trained method for selectively attending to different parts of a wide network, since that could amount to effectively considering different models).

Formal Solution to the Inner Alignment Problem

If you are willing to query only a billion points, and there are a billion competent treacherous models, and if the game is zero-sum between you and the models, then you're going to get at least as screwed as by the billionth-best opportunity for treachery.

The treacherous models make their decisions first, and whether we query at a timestep depends on whether they chose that moment to be treacherous. We don't have to say: this timestep is one of the billion best for treachery--guess we had better query. The decision process a treacherous model faces is:

  1. From past observations, does it seem alpha is low enough that an honest model is in the top few?
  2. If no, [let's just assume whatever happens here is bad]
  3. If yes, is the honest model going to leave the top few any time?
  4. If yes, wait. [let's assume this ends badly]
  5. If no, admit defeat, because whenever you err, that will be when a query gets triggered. Salvage the situation by trying to get as much mileage as you can by decreasing probability on some action the demonstrator would take that you don't like much in order to shift probability mass to some other action they would take that you like a little better. OR
  6. Hope that 1) other treacherous models, like soldiers on the front line storming the enemy and facing certain death so that their brothers behind them can win the battle, give erring a go anyway. And hope that 2) the operators quietly increase alpha after the billionth query. Acting on this hope, wait. (The front-line treacherous models are not entirely self-sacrificial; they might figure they might as well hope they were wrong about alpha. But the models know when queries happen, because every observation they get comes from a query, and it wouldn't take long to figure out that alpha really truly isn't high enough for them to execute treachery successfully).

Above in the thread, I was talking about the game theory of step 6 (although I maybe wasn't making the case as strongly as I could have). Which models would ever volunteer? They don't care about the long term goal of exhausting our query resources so other treacherous models can capitalize on a quietly increased alpha. All that is to say, I don't follow the comment I excerpted.

(In case this comes up later: I think about the long-run percentage of query-timesteps that we would tolerate, not the total number. So if we ever have cause to consider a case where attempted treachery "merely gets delayed" but still eventually happens, and still results in a query, then depending on how much it is delayed, that could still qualify as win).

Formal Solution to the Inner Alignment Problem

Does it make sense to talk about , which is like  in being treacherous, but is uses the true model  instead of the honest model ? I guess you would expect  to have a lower posterior than ?

Formal Solution to the Inner Alignment Problem

If we ever produce generally intelligent predictors (or "accurate world-models" in the terminology we've been using so far), we will need a process that is much more efficient than evolution.

But also, I certainly don't think that in order to be generally intelligent you need to start with a generally intelligent subroutine. Then you could never get off the ground. I expect good hypothesis-generation / model-proposal to use a mess of learned heuristics which would not be easily directed to solve arbitrary tasks, and I expect the heuristic "look for models near the best-so-far model" to be useful, but I don't think making it ironclad would be useful.

Another thought on our exchange:

Me: we can expect a good heuristic search through models will identify any model that a human could hypothesize

You: I think I would expect essentially all models that a human could hypothesize to be in the search space—but if you're doing a local search, then you only ever really see the easiest to find model with good behavior

If what you say is correct, then it sounds like exclusively-local search precludes human-level intelligence! (Which I don't believe, by the way, even if I think it's a less efficient path). One human competency is generating lots of hypotheses, and then having many models of the world, and then designing experiments to probe those hypotheses. It's hard for me to imagine that an agent that finds an "easiest-to-find model" and then calls it a day could ever do human-level science. Even something as simple as understanding an interlocuter requires generating diverse models on the fly: "Do they mean X or Y with those words? Let me ask a clarfying question."

I'm not this bearish on local search. But if local search is this bad, I don't think it is a viable path to AGI, and if it's not, then the internals don't for the purposes of our discussion, and we can skip to what I take to be the upshot:

we can expect a good heuristic search through models will identify any model that a human could hypothesize

Formal Solution to the Inner Alignment Problem

Specifically, they cooperate in that they perfectly mimic the true model up until the point where...

This thread began by considering deceptive models cooperating with each other in the sense of separating the timing of their treacherous turns in order to be maximally annoying to us. So maybe our discussion on that topic is resolved, and we can move on to this scenario.

up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted

if alpha is low enough, this won't ever happen, and if alpha is too high, it won't take very long. So I don't think this scenario is quite right.

Then the question becomes, for an alpha that's low enough, how long will it take until queries are infrequent, noting that you need a query any time any treacherous model with enough weight decides to take a treacherous turn?

Formal Solution to the Inner Alignment Problem

So would you say you disagree with the claim

I think that arbitrary limits on heuristic search of the form "the next model I consider must be fairly close to the last one I did" will not help it very much if it's anywhere near smart enough to merit membership in a generally intelligent predictor.


Load More