## AI ALIGNMENT FORUMAF

magfrump

Mathematician turned software engineer. I like swords and book clubs.

# Posts

Sorted by New

The case for aligning narrowly superhuman models

Looks like the initial question was here and a result around it was posted here. At a glance I don't see the comments with counterexamples, and I do see a post with a formal result, which seems like a direct contradiction to what you're saying, though I'll look in more detail.

Coming back to the scaling question, I think I agree that multiplicative scaling over the whole model size is obviously wrong. To be more precise, if there's something like a Q-learning inner optimizer for two tasks, then you need the cross product of the state spaces, so the size of the Q-space could scale close-to-multiplicatively. But the model that condenses the full state space into the Q-space scales additively, and in general I'd expect the model part to be much bigger--like the Q-space has 100 dimensions and the model has 1 billion parameters, so going adding a second model of 1 billion parameters and increasing the Q-space to 10k dimensions is mostly additive in practice, even if it's also multiplicative in a technical sense.

I'm going to update my probability that "GPT-3 can solve X, Y implies GPT-3 can solve X+Y," and take a closer look at the comments on the linked posts. This also makes me think that it might make sense to try to find simpler problems, even already-mostly-solved problems like Chess or algebra, and try to use this process to solve them with GPT-2, to build up the architecture and search for possible safety issues in the process.

The case for aligning narrowly superhuman models

I'm replying on my phone right now because I can't stop thinking about it. I will try to remember to follow up when I can type more easily.

I think the vague shape of what I think I disagree about is how dense GPT-3's sets of implicit knowledge are.

I do think we agree that GPT-5000 will be broadly superhuman, even if it just has a grab bag of models in this way, for approximately the reasons you give.

I'm thinking about "intelligent behavior" as something like the set of real numbers, and "human behavior" as covering something like rational numbers, so we can get very close to most real numbers but it takes some effort to fill in the decimal expansion. Then I'm thinking of GPT-N as being something like integers+1/N. As N increases, this becomes close enough to the rational numbers to approximate real numbers, and can be very good at approximating some real numbers, but can't give you incomputable numbers (unaligned outcomes) and usually won't give you duplicitous behavior (numbers that look very simple at first approximation but actually aren't, like .2500000000000004, which seems to be 1/4 but secretly isn't). I'm not sure where that intuition comes from but I do think I endorse it with moderate confidence.

Basically I think for minimal circuit reasons that if "useful narrowly" emerges in GPT-N, then "useful in that same domain but capable of intentionally doing a treacherous turn" emerges later. My intuition is that this won't be until GPT-(N+3) or more, so if you are able to get past unintentional turns like "the next commenter gives bad advice" traps, this alignment work is very safe, and important to do as fast as possible (because attempting it later is dangerous!)

In a world where GPT-(N+1) can do a treacherous turn, this is very dangerous, because you might accidentally forget to check if GPT-(N-1) can do it, and get the treacherous turn.

My guess is that you would agree that "minimal circuit that gives good advice" is smaller than "circuit that gives good advice but will later betray you", and therefore there exist two model sizes where one is dangerous and one is safe but useful. I know I saw posts on this a while back, so there may be relevant math about what that gap might be, or it might be unproven but with some heuristics of what the best result probably is.

My intuition is that combining narrow models is multiplicative, so that adding a social manipulation model will always add an order of magnitude of complexity. My guess is that you don't share this intuition. You may think of model combination as additive, in which case any model bigger than a model that can betray you is very dangerous, or you might think the minimal circuit for betrayal is not very large, or you might think that GPT-2-nice would be able to give good advice in many ways so GPT-3 is already big enough to contain good advice plus betrayal in many ways.

In particular if combining models is multiplicative in complexity, a model could easily learn two different skills at the same time, while being many orders of magnitude away from being able to use those skills together.

The case for aligning narrowly superhuman models

I think this is obscuring (my perception of) the disagreement a little bit.

I think what I'm saying is, GPT-3 probably doesn't have any general truth+noise models. But I would expect it to copy a truth+noise model from people, when the underlying model is simple.

I then expect GPT-3 to "secretly" have something like an interesting diagnostic model, and probably a few other narrowly superhuman skills.

But I would expect it to not have any kind of significant planning capacity, because that planning capacity is not simple.

In particular my expectation is that coherently putting knowledge from different domains together in generally useful ways is MUCH, MUCH harder than being highly superhuman in narrow domains. Therefore I expect Ajeya's approach to be both effective, because "narrowly superhuman" can exist, and reasonably safe, because the gap between "narrowly superhuman" or even "narrowly superhuman in many ways" and "broadly superhuman" is large so GPT-3 being broadly superhuman is unlikely.

Phrased differently, I am rejecting your idea of smartness-spectrum. My intuition is that levels of GPT-N competence will scale the way computers have always scaled at AI tasks--becoming usefully superhuman at a few very quickly, while taking much much longer to exhibit the kinds of intelligence that are worrying, like modeling human behavior for manipulation.

The case for aligning narrowly superhuman models

This seems like it's using the wrong ontology to me.

Like, in my mind, there are things like medical diagnostics or predictions of pharmaceutical reactions, which are much easier cognitive tasks than general conversation, but which humans are specialized away from.

For example, imagine the severity of side effects from a specific medication. can be computed by figuring out 15 variables about the person and putting them into a neural network with 5000 parameters, and the output is somewhere in a six-dimensional space, and this model is part of a general model of human reactions to chemicals.

Then GPT-3 would be in a great position to use people's reddit posts talking about medication side effects to find this network. I doubt that medical science in our current world could figure that out meaningfully. It would be strongly superhuman in this important medical task, but nowhere near superhuman in any other conversational task.

My intuition is that most professional occupations are dominated by problems like this, that are complex enough that we as humans can only capture them as intuitions, but simple enough that the "right" computational solution would be profoundly superhuman in that narrow domain, without being broadly superhuman in any autonomous sense.

Maybe a different reading of your comment is something like, there are so many of these things that if a human had access to superhuman abilities across all these individual narrow domains, that human could use it to create a decisive strategic advantage for themself, which does seem possibly very concerning.

Hiring engineers and researchers to help align GPT-3

What is the expected time frame of the openings?

I am personally indisposed until ~end of October and may not be ready to start a new job for a little while after that, but would otherwise be very excited for such a role.

Somewhat related, do you have an idea of how many openings there will be? Like, fewer than 3 or more than 20, for example?

So one of the first thoughts I had when reading this was whether you can model any Radical Probabilist as a Bayesian agent that has some probability mass on "my assumptions are wrong" and will have that probability mass increase so that it questions its assumptions over a "reasonable timeframe" for whatever definition.

For the case of coin flips, there is a clear assumption in the naive model that the coin flips are independent of each other, which can be fairly simply expressed as \$P(flip_i = H | flip_{j} = H) = P(flip_i = H | flip_{j} = T) \forall j < i\$. In the case of the coin that flips 1 heads, 5 tails, 25 heads, 125 tails, just evaluating j=i-1 through the 31st flip gives P(H|last flip heads) = 24/25, P(H|last flip tails) = 1/5, which is unlikely at p=~1e-4, which is approximately the difference in bayesian weight between the hypothesis H1: the coin flips heads 26/31 times (P(E|H1)=~1e-6) and H0: the coin flips heads unpredictably (1/2 the time, P(E|H0)=~4e-10) which is a better hypothesis in the long run until you expand your hypothesis space.

So in this case, the "I don't have the hypothesis in my space" hypothesis actually wins out right around the 30th-32nd flip, possibly about the same time a human would be identifying the alternate hypothesis. That seems helpful!

However this relies on the fact that this specific hypothesis has a single very clear assumption and there is a single very clear calculation that can be done to test that assumption. Even in this case though, the "independence of all coin flips" assumption makes a bunch more predictions, like that coin flips two apart are independent, etc. calculating all of these may be theoretically possible but it's arduous in practice, and would give rise to far too much false evidence--for example, in real life there are often distributions that look a lot like normal distributions in the general sense that over half the data is within one standard deviation of the mean and 90% of the data is within two standard deviations, but where if you apply an actual hypothesis test of whether the data is normally distributed it will point out some ways that it isn't exactly normal (only 62% of the data is in this region, not 68%! etc.).

It seems like the idea of having a specific hypothesis in your space labeled "I don't have the right hypothesis in my space" can work okay under the conditions

1. You have a clearly stated assumption which defines your current hypothesis space

2. You have a clear statistical test which shows when data doesn't match your hypothesis space

3. You know how much data needs to be present for that test to be valid--both in terms of the minimum for it to distinguish itself so you don't follow conspiracy theories, and something like a maximum (maybe this will naturally emerge from tracking the probability of the data given the null hypothesis, maybe not).

I have no idea whether these conditions are reasonable "in practice" whatever that means, so I'm not really clear whether this framework is useful, but it's what I thought of and I want to share even negative results in case other people had the same thoughts.