Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
Yes, to be clear, I agree that in as much this question makes sense, the extrapolated volition would indeed end up basically ideal by your lights.
Regardless, the whole point of my post is exactly that I think we shouldn't over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
Cool, that makes sense. FWIW, I interpreted the overall essay to be more like "Alignment remains a hard unsolved problem, but we are on pretty good track to solve it", and this sentence as evidence for the "pretty good track" part. I would be kind of surprised if that wasn't why you put that sentence there, but this kind of thing seems hard to adjudicate.
Do we then say that Claude's extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
But in that case, wouldn't a rock that has "just ask Evan" written on it, be even better than Claude? Like, I felt confident that you were talking about Claude's extrapolated volition in the absence of humans, since making Claude into a rock that when asked about ethics just has "ask Evan" written on it does not seem like any relevant evidence about the difficulty of alignment, or its historical success.
This comment had a lot of people downvote it (at this time, 2 overall karma with 19 votes). It shouldn't have been, and I personally believe this is a sign of people being attached to AI x-risk ideas and of those ideas contributing to their entire persona rather than strict disagreement. This is something I bring to conversations about AI risk, since I believe folks will post-rationalize. The above comment is not low effort or low value.
I generally think it makes sense for people to have pretty complicated reasons for why they think something should be downvoted. I think this goes more for longer content, which often would require an enormous amount of effort to respond to explicitly.
I have some sympathy for being sad here if a comment ends up highly net-downvoted, but FWIW, I think 2 karma feels vaguely in the right vicinity for this comment, maybe I would upvote it to +6, but I would indeed be sad to see it at +20 or whatever since I do think it's doing something pretty tiring and hard to engage with. Directional downvoting is a totally fine use of downvoting, and if you think a comment is overrated but not bad, please downvote it until its karma reflects where you want it to end up!
(This doesn't mean it doesn't make sense to do sociological analysis of cultural trends on LW using downvoting, but I do want to maintain the cultural locus where people can have complicated reasons for downvoting and where statements like "if you disagree strongly with the above comment you should force yourself to outline your views" aren't frequently made. The whole point of the vote system is to get signal from people without forcing them to do huge amounts of explanatory labor. Please don't break that part)
I mean, in my case the issue is not that it hallucinated, it's that it hallucinated in a way that was obviously optimized to look good to me.
Like, if the LLMs just sometimes randomly made up stuff, that would be fine, but in cases like this they will very confidently make up stuff that really looks exactly like the kind of thing that would get them high RL reward if it was real, and then also kind of optimize things to make it look real to me.
It seems very likely that the LLM "knew" that it couldn't properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.
Sure, here is an example of me trying to get it to extract quotes from a big PDF: https://chatgpt.com/share/6926a377-75ac-8006-b7d2-0960f5b656f1
It's not fully apparent from the transcript, but basically all the quotes from the PDF are fully made up. And emphasizing to please give me actual quotes produced just more confabulated quotes. And of course those quotes really look like they are getting me exactly what I want!
They know they're not real on reflection, but not as they're doing it. It's more like fumbling and stuttering than strategic deception.
I will agree that making up quotes is literally dishonest but it's not purposeful deliberate deception.
But the problem is when I ask them "hey, can you find me the source for this quote" they usually double down and cite some made-up source, or they say "oh, upon reflection this quote is maybe not quite real, but the underlying thing is totally true" when like, no, the underlying thing is obviously not true in that case.
I agree this is the model lying, but it's a very rare behavior with the latest models.
I agree that literally commenting out tests is now rare, but other versions of this are still quite common. Semi-routinely when I give AIs tasks that are too hard will they eventually just do some other task that surface level looks like it got the task done, but clearly isn't doing the real thing (like leaving a function unimplemented, or avoiding doing some important fetch and using stub data). And it's clearly not the case that the AI doesn't know that it didn't do the task, because at that point it might have spent 5+ minutes and 100,000k+ tokens slamming its head against the wall trying to do it, and then at the end it just says "I have implemented the feature! You can see it here. It all works. Here is how I did it...", and clearly isn't drawing attention to how it clearly cut corners after slamming its head against the wall for 5+ minutes.
I mean, the models are still useful!
But especially when it comes to the task of "please go and find me quotes or excerpts from articles that show the thing that you are saying", the models really seem to do something that seems closer to "lying". This is a common task I ask the LLMs to perform because it helps me double-check what the models are saying.
And like, maybe you have a good model of what is going in with the model that isn't "lying", but I haven't heard a good explanation. It seems to me very similar to the experience of having a kind of low-integrity teenager just kind of make stuff up to justify whatever they said previously, and then when you pin them down, they flip and says "of course, you are totally right, I was wrong, here is another completely made up thing that actually shows the opposite is true".
And these things are definitely quite trajectory dependent. If you end up asking an open-ended question where the model confabulates some high-level take, and then you ask it to back that up, then it goes a lot worse than if you ask it for sources and quotes from the beginning.
Like, none of these seems very long-term scheming oriented, but it's also really obvious to me the model isn't trying that hard to do what I want.
It’s really difficult to get AIs to be dishonest or evil by prompting
I am very confused about this statement. My models lie to me every day. They make up quotes they very well know aren't real. They pretend that search results back up the story they are telling. They will happily lie to others. They comment out tests, and pretend they solve a problem when it's really obvious they haven't solved a problem.
I don't know how much this really has that much to do what these systems will do when they are superintelligent, but this sentence really doesn't feel anywhere remotely close to true.
But like, the "accurate understanding" of the situation here relies on the fact that the reward-hacking training was intended. The kind of scenario we are most worried about are (for example) RL-environments where scheming-like behavior is indeed important for performance (most obviously in adversarial games, but honestly almost anything with a human rater will probably eventually do this).
And in those scenarios we can't just "clarify that exploitation of reward in this training environment is both expected and compatible with broadly aligned behavior". Like, it might genuinely not be "expected" by anyone and whether it's "compatible with broadly aligned behavior" feels like an open question and in some sense a weird self-fulfilling prophecy statement that the AI will eventually have its own opinions on.
Promoted to curated: I do think this post summarizes one of the basic intuition generators for predicting AI motivations. It's missing a lot of important stuff, especially as AIs get more competent (in-particular it doesn't cover reflection, which I expect to be among the primary dynamics shaping motivations of powerful AIs), but it's still quite helpful to be written up in more explicit form.