Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
They know they're not real on reflection, but not as they're doing it. It's more like fumbling and stuttering than strategic deception.
I will agree that making up quotes is literally dishonest but it's not purposeful deliberate deception.
But the problem is when I ask them "hey, can you find me the source for this quote" they usually double down and cite some made-up source, or they say "oh, upon reflection this quote is maybe not quite real, but the underlying thing is totally true" when like, no, the underlying thing is obviously not true in that case.
I agree this is the model lying, but it's a very rare behavior with the latest models.
I agree that literally commenting out tests is now rare, but other versions of this are still quite common. Semi-routinely when I give AIs tasks that are too hard will they eventually just do some other task that surface level looks like it got the task done, but clearly isn't doing the real thing (like leaving a function unimplemented, or avoiding doing some important fetch and using stub data). And it's clearly not the case that the AI doesn't know that it didn't do the task, because at that point it might have spent 5+ minutes and 100,000k+ tokens slamming its head against the wall trying to do it, and then at the end it just says "I have implemented the feature! You can see it here. It all works. Here is how I did it...", and clearly isn't drawing attention to how it clearly cut corners after slamming its head against the wall for 5+ minutes.
I mean, the models are still useful!
But especially when it comes to the task of "please go and find me quotes or excerpts from articles that show the thing that you are saying", the models really seem to do something that seems closer to "lying". This is a common task I ask the LLMs to perform because it helps me double-check what the models are saying.
And like, maybe you have a good model of what is going in with the model that isn't "lying", but I haven't heard a good explanation. It seems to me very similar to the experience of having a kind of low-integrity teenager just kind of make stuff up to justify whatever they said previously, and then when you pin them down, they flip and says "of course, you are totally right, I was wrong, here is another completely made up thing that actually shows the opposite is true".
And these things are definitely quite trajectory dependent. If you end up asking an open-ended question where the model confabulates some high-level take, and then you ask it to back that up, then it goes a lot worse than if you ask it for sources and quotes from the beginning.
Like, none of these seems very long-term scheming oriented, but it's also really obvious to me the model isn't trying that hard to do what I want.
It’s really difficult to get AIs to be dishonest or evil by prompting
I am very confused about this statement. My models lie to me every day. They make up quotes they very well know aren't real. They pretend that search results back up the story they are telling. They will happily lie to others. They comment out tests, and pretend they solve a problem when it's really obvious they haven't solved a problem.
I don't know how much this really has that much to do what these systems will do when they are superintelligent, but this sentence really doesn't feel anywhere remotely close to true.
But like, the "accurate understanding" of the situation here relies on the fact that the reward-hacking training was intended. The scenarios we are most worried about are for example RL-environments where scheming-like behavior is indeed important for performance (most obviously in adversarial games, but honestly almost anything with a human rater will probably eventually do this).
And in those scenarios we can't just "clarify that exploitation of reward in this training environment is both expected and compatible with broadly aligned behavior". Like, it might genuinely not be "expected" by anyone and whether it's "compatible with broadly aligned behavior" feels like an open question and in some sense a weird self-fulfilling prophecy statement that the AI will eventually have its own opinions on.
This seems like a good empirical case study and I am glad you wrote it up!
This is “inoculation prompting”: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization.
That said, I can't be the only one who thinks this is somewhat obviously a kind of insane way of solving this problem. Like, this really doesn't seem like the kind of technique that generalizes to superintelligence, or even the kind of technique we can have any confidence will generalize to more competent future models (at which point they might be competent and misaligned enough to scheme and hide their misalignment, making patch-strategies like this no longer viable).
Maybe there is some more elegant core here that should give us confidence this will generalize more than it sounds like to me on a first pass, or maybe you cover some of this in the full paper which I haven't read in detail. I was just confused that this blogpost doesn't end with "and this seems somewhat obviously like a super hacky patch and we should not have substantial confidence that this will work on future model generations, especially ones competent enough to pose serious risks".
Thus, the numbers I give below are somewhat more optimistic than what you'd get just given the level of political will corresponding to each of these scenarios (as this will might be spent incompetently).
FWIW, for at least plan A and plan B, I feel like the realistic multiplier on how optimistic these are is like at least 3x? Like, I don't see an argument for this kind of plan working with 90%+ probability given realistic assumptions about execution quality.
(I also have disagreements about whether this will work, but at least plan A well-executed seems like it would notice it was starting to be very reckless and then be in a good position to slow down more)
Promoted to curated! Empirical investigation of this kind of stuff seems surprisingly underdone, and pretty easy to do, so I am very glad you gave it a shot. I also thought the post was very clear and well-written with a good mixture of high-quality technical exposition, humor, flow and broader context.
But even if human takeover is 1/10 as bad as AI takeover, the case for working on AI-enabled coups is strong due to its neglectedness.
The logic for this doesn't check out, I think? If human takeover is 1/10 as bad as AI takeover, and human takeover pre-empts AI takeover (because it ends the race and competitiveness dynamics giving rise to most of the risk of AI takeover), then a human takeover might be the best thing that could happen to humanity. This makes the case for working on AI-enabled coups particularly weak.
If by 1/10th as bad we mean "we lose 10% of the value of the future, as opposed to ~100% of the value of the future" then increasing marginal probability of human takeover seems great as long as you assign >10% probability to AI takeover[1], which I think most people who have thought a lot about AI risk do.
And you expect the risk to be uncorrelated, i.e. human takeover equally reduces the probability of AI takeover, and no takeover from either AI or small groups of humans
I think there's a very important difference between the model adopting the goal it is told in context, and the model having some intrinsic goal that transfers across contexts (even if it's the one we roughly intended)
I think this is the point where we disagree. Or like, it feels to me like an orthogonal dimension that is relevant for some risk modeling, but not at the core of my risk model.
Ultimately, even if an AI were to re-discover the value of convergent instrumental goal each time it gets instantiated into a new session/context, that would still get you approximately the same risk model. Like, in a more classical AIXI-ish model, you can imagine having a model instantiated with a different utility function each time. Those utility functions will still almost always be best achieved by pursuing convergent instrumental goals, and so the pursuit of those goals will be a consistent feature of all of these systems, even if the terminal goals of the system are not stable.
Of course, any individual AI system with a different utility function, especially in as much as the utility function a process component, might not pursue every single convergent instrumental goal, but they will all behave in broadly power-seeking, self-amplifying, and self-preserving ways, unless they are given a goal that really very directly conflicts with one of these.
In this context, there is no "intrinsic goal that transfers across context". It's just each instantiation of the AI realizing that convergent instrumental goals are best for approximately all goals, including the one it has right now, and starts pursuing them. No need for continuity in goals, or self-identity or anything like that.
(Happy to also chat about this some other time. I am not in a rush, and something about this context feels a bit confusing or is making the conversation hard.)
Sure, here is an example of me trying to get it to extract quotes from a big PDF: https://chatgpt.com/share/6926a377-75ac-8006-b7d2-0960f5b656f1
It's not fully apparent from the transcript, but basically all the quotes from the PDF are fully made up. And emphasizing to please give me actual quotes produced just more confabulated quotes. And of course those quotes really look like they are getting me exactly what I want!