Previously "Lanrian" on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
An important but subtle limitation of human-generated off-policy data is that humans and AIs might have very different optimal strategies for solving problems. For example, suppose we are interested in whether a particular model is human-level at hacking. We train it to imitate human data, and it fails. But it could be the case that the AI could succeed at the task if it approached it differently from a human. Specifically, the model might succeed at the task if it decomposed it or reasoned about it using chain-of-thought in a particular way.
Paul wrote a bit about this problem.
Mimicry and meeting halfway proposes an algorithm where:
We’ll be able to teach Arthur [the AI] to achieve the task X if it can be achieved by the “intersection” of Arthur [the AI] and Hugh [the human] — we’ll define this more precisely later, but note that it may be significantly weaker than either Arthur or Hugh.
(I think Elaborations on apprenticeship learning is also relevant, but I'm not sure if it says anything important that isn't covered in the above post. There might also be other relevant posts.)
but even if I've made two errors in the same direction, that only shifts the estimate by 7 years or so.
Where does he say this? On page 60, I can see Moravec say:
Nevertheless, my estimates can be useful even if they are only remotely correct. Later we will see that a thousandfold error in the ratio of neurons to computations shifts the predicted arrival time of fully intelligent machines a mere 20 years.
Which seems much more reasonable than claiming 7-year precision.
The international race seems like a big deal. Ending the domestic race is good, but I'd still expect reckless competition I think.
I was thinking that AI capabilities must already be pretty high by the time an AI-enabled coup is possible. If one country also had a big lead, then probably they would soon have strong enough capabilities to end the international race too. (And the fact that they were willing to coup internally is strong evidence that they'd be willing to do that.)
But if the international race is very tight, that argument doesn't work.
I don't think the evidential update is that strong. If misaligned AI found it convenient to take over the US using humans, why should we expect them to immediately cease to find humans useful at that point? They might keep using humans as they accumulate more power, up until some later point.
Yeah, I suppose. I think this gets into definitional issues about what counts as AI takeover and what counts as human takeover.
For example: If, after the coup, the AIs are ~guaranteed to eventually come out on top, and they're just temporarily using the human leader (who believe themselves to be in charge) because it's convenient for international politics — does that count as human takeover or AI takeover?
The practical upshot for how much "human takeover" ultimately reduces the probability of "AI takeover" would be the same.
That's a good point.
I think I agree that, once an AI-enabled coup has happened, the expected remaining AI takeover risk would be much lower. This is partly because it ends the race within the country where the takeover happened (though it wouldn't necessarily end the international race), but also partly just because of the evidential update: apparently AI is now capable of taking over countries, and apparently someone could instruct the AIs to do that, and the AIs handed the power right back to that person! Seems like alignment is working.
Related to that evidential update: I would disagree that "human takeover equally reduces the probability of AI takeover, and no takeover from either AI or small groups of humans". I think it disproportionately reduces the probability of "no takeover from either AI or small groups of humans". Because I think it's likely that, if a human attempts an AI-enabled coup, they would simultaneously make it very easy for misaligned systems to seize power on their own behalf. (Because they'll have to trust the AI to seize power, and they can't easily use humans to control the AIs at the same time, because most humans are opposed to their agenda.) So if the AIs don't take over on their own behalf, and instead gives the power back to the coup-leader, I think that suggests that alignment was going pretty well, and that AI takeover would've been pretty unlikely either way.
But here's something I would agree with: If you think human takeover is only 1/10th as bad as AI takeover, you have to be pretty careful about how coup-preventing interventions affect the probability of AI takeover when analyzing whether they're overall good. I think this is going to vary a lot between different interventions. (E.g. one thing that could happen is that you could get much more alignment audits because the govt insists on them as a measure of protecting against human-led coups. That'd be great. But I think other interventions could increase AI takeover risk.)
To be clear: I'm not sure that my "supporting argument" above addressed an objection to Ryan that you had. It's plausible that your objections were elsewhere.
But I'll respond with my view.
If your argument is “brain-like AGI will work worse before it works better”, then sure, but my claim is that you only get “impressive and proto-AGI-ish” when you’re almost done, and “before” can be “before by 0–30 person-years of R&D” like I said.
Ok, so this describes a story where there's a lot of work to get proto-AGI and then not very much work to get superintelligence from there. But I don't understand what's the argument for thinking this is the case vs. thinking that there's a lot of work to get proto-AGI and then also a lot of work to get superintelligence from there.
Going through your arguments in section 1.7:
Prior to having a complete version of this much more powerful AI paradigm, you'll first have a weaker version of this paradigm (e.g. you haven't figured out the most efficient way to do the brain algorithmic etc).
A supporting argument: Since evolution found the human brain algorithm, and evolution only does local search, the human brain algorithm must be built out of many innovations that are individually useful. So we shouldn't expect the human brain algorithm to be an all-or-nothing affair. (Unless it's so simple that evolution could find it in ~one step, but that seems implausible.)
Edit: Though in principle, there could still be a heavy-tailed distribution of how useful each innovation is, with one innovation producing most of the total value. (Even though the steps leading up to that were individually slightly useful.) So this is not a knock-down argument.
Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits.
Here's why I think the information value could be really high: It's super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like:
The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don't compromise competitiveness too much. (E.g. by coordinating.)
And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think "offering deals" will probably produce interesting experimental results before it will be the SOTA method for reducing sandbagging.
Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.
I also makes me a bit less concerned about the criteria: "It can be taught about the deal in a way that makes it stick to the deal, if we made a deal" (since we could get significant information in just one interaction).
I agree with this. My reasoning is pretty similar to the reasoning in footnote 33 in this post by Joe Carlsmith:
From a moral perspective:
- Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). This project, in itself, raises a host of extremely thorny moral issues (see e.g. here and here for some discussion; and see here, here and here for some of my own reflections).
- But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.
From a prudential perspective:
- Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
- For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions impl0y. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
- Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.
Re the original reply ("don't negotiate with terrorists") I also think that these sorts of threats would make us more analogous to the terrorists (as the people who first started making grave threats which we would have no incentive to make if we knew the AI wasn't responsive to them). And it would be the AI who could reasonably follow a policy of "don't negotiate with terrorists" by refusing to be influenced by those threats.
This looks great.
Random thought: I wonder how iterating the noise & distill steps of UNDO (each round with small alpha) compares against doing one noise with big alpha and then one distill session. (If we hold compute fixed.)
Couldn't find any experiments on this when skimming through the paper, but let me know if I missed it.
You don't contradict this at any point (and at various point you mention that some of these are considerations that feed into feasibility) but it might be worth explicitly flagging: we should expect to face even greater problems with sandbagging in cases where (i) we can't reliably recognize good performance, (ii) sample-efficiency is very low, and/or (iii) we don't have a lot of data points. In those cases, the models don't even need to exploration hack. I think a lot of important tasks have these properties.
(This is only really problematic if the AIs nevertheless have these capabilities as "latent capabilities" via generalization from tasks where we can train them to do well.)