Lukas Finnveden — AI Alignment Forum

AI ALIGNMENT FORUM
AF

(I think epoch's paper on this takes a different approach and suggests an outside view of hyperbolic growth lasting for ~1.5y OOMs without bottlenecks, because that was the amount grown between the agricultural evolution and the population bottleneck starting. That feels weaker to me than looking at more specific hypotheses of bottlenecks, and I do think epoch's overall view is that it'll likely be more than 1.5 OOMs. But wanted to flag it as another option for an outside view estimate.)

The Industrial Explosion

Lukas Finnveden9d10

I do feel like, given the very long history of sustained growth, it's on the sceptic to explain why their proposed bottleneck will kick in with explosive growth but not before. So you could state my argument as: raw materials never bottlenecked growth before; no particular reason they would just bc growth is faster bc that faster growth is driven by having more labour+capital which can be used for gathering more resources; so we shouldn't expect raw materials to bottleneck growth in the future.

Gotcha. I think the main thing that's missing from this sort of argument (for me to be happy with it) is some quantification of our evidence. Growth since 10k years ago has been 4-5 OOMs, I think, and if you're just counting since the industrial revolution maybe it's going to be a bit more than half of that.

So with that kind of outside view, it would indeed be surprising if we ran into resource bottlenecks in our next OOM of growth, and <50% (but not particularly surprising) if we ran into resource bottlenecks in the next 3 OOMs of growth.

The Industrial Explosion

Lukas Finnveden13d*20

It’s possible that resource constraints are a bottleneck, and this is an important area for further research, but our guess is that they won’t be. Historically, resource bottlenecks have never capped GDP growth – they’ve been circumvented through a combination of efficiency improvements, resource substitutions, and improved mining capabilities.

Well, most of human history was spent at the malthusian limit. With infinite high-quality land to expand into, we'd probably have been growing at much, much faster rates through human history.

(It's actually kind of confusing. Maybe all animals would've evolved to exponentially blow up as fast as possible? Maybe humans would never have evolved because our reproduction is simply too slow? It's actually kind of hard to design a situation where you never have to fight for land, given that spatial expansion is at most square or cubic, which is slower than the exponential rate at which reproduction could happen.)

Maybe you mean "resource limits have never put a hard cap on GDP", which seems true. Though this seems kind of like a fully general argument — nothing has ever put a hard cap on GDP, since it's still growing.

Edit: Hm, maybe historical land constraints at the malthusian limit has mostly been about energy, though, rather than raw materials? Ie: If you doubled Earth's size without doubling any valuable materials — just allowing Earth to absorb more sunlight, maybe that would be almost as good as doubling Earth in its entirety. That seems more plausible. Surely growth would've been at least a bit faster if we never run out of high-quality sources of any raw material, but I'm not sure how much of a difference it would make.

It's a bit of a confusing comparison to make. If we doubled Earth's area (and not resources) now, that would scarcely make a difference at all, but if it had been twice as large for millions of years, then maybe plants and animal life would've spread to the initially-empty spaces, making it potentially useable.

Alignment remains a hard, unsolved problem

Lukas Finnveden19d*21

”The skills won't stop being complementary” — in what sense will they be complementary when the AIs are better at everything?

”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”

As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.

I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!

Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.

I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, which are possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.

I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.

(I do think that persona-manipulation and more broadly "generalization science" is still interesting. But I wouldn't say it's doing a lot to tackle outer alignment operationalized as "the problem of overseeing systems that are smarter than you are".)

Alignment remains a hard, unsolved problem

Lukas Finnveden19d33

Does the 'induction step' actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model.

Probably this is because humans and AIs have very complementary skills. Once AIs are broadly more competent than humans, there's no reason why they should get such a big boost from being paired with humans.

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Lukas Finnveden4mo30

You don't contradict this at any point (and at various point you mention that some of these are considerations that feed into feasibility) but it might be worth explicitly flagging: we should expect to face even greater problems with sandbagging in cases where (i) we can't reliably recognize good performance, (ii) sample-efficiency is very low, and/or (iii) we don't have a lot of data points. In those cases, the models don't even need to exploration hack. I think a lot of important tasks have these properties.

(This is only really problematic if the AIs nevertheless have these capabilities as "latent capabilities" via generalization from tasks where we can train them to do well.)

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Lukas Finnveden4mo30

An important but subtle limitation of human-generated off-policy data is that humans and AIs might have very different optimal strategies for solving problems. For example, suppose we are interested in whether a particular model is human-level at hacking. We train it to imitate human data, and it fails. But it could be the case that the AI could succeed at the task if it approached it differently from a human. Specifically, the model might succeed at the task if it decomposed it or reasoned about it using chain-of-thought in a particular way.

Paul wrote a bit about this problem.

Mimicry and meeting halfway proposes an algorithm where:

We’ll be able to teach Arthur [the AI] to achieve the task X if it can be achieved by the “intersection” of Arthur [the AI] and Hugh [the human] — we’ll define this more precisely later, but note that it may be significantly weaker than either Arthur or Hugh.

(I think Elaborations on apprenticeship learning is also relevant, but I'm not sure if it says anything important that isn't covered in the above post. There might also be other relevant posts.)

Biology-Inspired AGI Timelines: The Trick That Never Works

Lukas Finnveden4mo10

but even if I've made two errors in the same direction, that only shifts the estimate by 7 years or so.

Where does he say this? On page 60, I can see Moravec say:

Nevertheless, my estimates can be useful even if they are only remotely correct. Later we will see that a thousandfold error in the ratio of neurons to computations shifts the predicted arrival time of fully intelligent machines a mere 20 years.

Which seems much more reasonable than claiming 7-year precision.

AI-enabled coups: a small group could use AI to seize power

Lukas Finnveden4mo10

The international race seems like a big deal. Ending the domestic race is good, but I'd still expect reckless competition I think.

I was thinking that AI capabilities must already be pretty high by the time an AI-enabled coup is possible. If one country also had a big lead, then probably they would soon have strong enough capabilities to end the international race too. (And the fact that they were willing to coup internally is strong evidence that they'd be willing to do that.)

But if the international race is very tight, that argument doesn't work.

I don't think the evidential update is that strong. If misaligned AI found it convenient to take over the US using humans, why should we expect them to immediately cease to find humans useful at that point? They might keep using humans as they accumulate more power, up until some later point.

Yeah, I suppose. I think this gets into definitional issues about what counts as AI takeover and what counts as human takeover.

For example: If, after the coup, the AIs are ~guaranteed to eventually come out on top, and they're just temporarily using the human leader (who believe themselves to be in charge) because it's convenient for international politics — does that count as human takeover or AI takeover?

If it counts as "AI takeover", then my argument would apply. (Saying that "AI takeover" would be much less likely after successful "human takeover", but also that "human takeover" mostly takes probability mass from worlds where takeover wasn't going to happen.)
If it counts as "human takeover", then my argument would not apply, and "AI takeover" would be pretty likely to happen after a temporary "human takeover".

The practical upshot for how much "human takeover" ultimately reduces the probability of "AI takeover" would be the same.

AI-enabled coups: a small group could use AI to seize power

Lukas Finnveden4mo40

That's a good point.

I think I agree that, once an AI-enabled coup has happened, the expected remaining AI takeover risk would be much lower. This is partly because it ends the race within the country where the takeover happened (though it wouldn't necessarily end the international race), but also partly just because of the evidential update: apparently AI is now capable of taking over countries, and apparently someone could instruct the AIs to do that, and the AIs handed the power right back to that person! Seems like alignment is working.

Related to that evidential update: I would disagree that "human takeover equally reduces the probability of AI takeover, and no takeover from either AI or small groups of humans". I think it disproportionately reduces the probability of "no takeover from either AI or small groups of humans". Because I think it's likely that, if a human attempts an AI-enabled coup, they would simultaneously make it very easy for misaligned systems to seize power on their own behalf. (Because they'll have to trust the AI to seize power, and they can't easily use humans to control the AIs at the same time, because most humans are opposed to their agenda.) So if the AIs don't take over on their own behalf, and instead gives the power back to the coup-leader, I think that suggests that alignment was going pretty well, and that AI takeover would've been pretty unlikely either way.

But here's something I would agree with: If you think human takeover is only 1/10th as bad as AI takeover, you have to be pretty careful about how coup-preventing interventions affect the probability of AI takeover when analyzing whether they're overall good. I think this is going to vary a lot between different interventions. (E.g. one thing that could happen is that you could get much more alignment audits because the govt insists on them as a measure of protecting against human-led coups. That'd be great. But I think other interventions could increase AI takeover risk.)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments