"It’s really difficult to get AIs to be dishonest or evil by prompting, you have to fine-tune them."
This is much less of a killer argument if we expect increasing optimisation power to be applied over time.
When ChatGPT came out I was surprised by how aligned the model was relative to its general capabilities. This was definitely a signficant update compared to what I expected from older AI arguments (say the classic story about a robot getting a coffee and pushing a kid out of the way).
However, what I didn't realise at the time was that the main reason why we weren't seeing misbehaviour was a lack of optimisation power. Whilst it may have seemed that you'd be able to do a lot with gpt4 level agents in a loop, this mostly just resulted in them going around in circles. From casual use these models seemed a lot better at optimising than they actually were because optimising over time required a degree of coherence that these agents lacked.
Once we started applying more optimisation power and reached the o1 series of models then we started seeing misbehaviour a lot more. Just to be clear, what we're seeing isn't quite a direct instantiation of the old instrumental convergence arguments. Instead what we're seeing is surviving[1] unwanted tendencies from the pretraining distribution being differentially selected for. In other words, it's more of a combination of the pretraining distribution and optimisation power as opposed to the old instrumental convergence arguments that were based on an idealisation of a perfect optimiser.
However, as we increase the amount of optimisation power, we should expect the instrumental convergence arguments to mean that unwanted behaviour can still be brought to the surface even with lower and lower propensities of the unamplified model to act in that way. Maybe we can reduce these propensities faster than the increasing optimisation power brings them out (and indeed safety teams are attempting to achieve this), but that remains to be seen and the amount of money/talent being directed into optimisation is much more than the amount being directed into safety.
In particular, that not removed by RLHF. This is surprisingly effective, but it doesn't remove everything.
Is there any chance you could define what you mean by "open agency"? Do you essentially mean "distributed agency"?
Lots of fascinating points, however:
a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it's also worth flagging that there's less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you've neglected the potential for us to apply filtering to the training data. Whilst I don't think the solution will be that simple, I don't think the relation is quite as straightforward as you claim.
c) The discussion of "how do you think the LLM's feel about these experiments" is interesting, but it is also overly anthromorphic. LLM's are anthromorphic to a certain extent having been trained on human data, but it is still mistaken to run a purely anthromorphic analysis that doesn't account for other training dynamics.
d) Whilst you make a good point in terms of how the artificiality of the scenario might be affecting the experiment, I feel you're being overly critical of some of research into how models might misbehave. Single papers are rarely definitive and often there's value in just showing a phenomenon exists in order to spur further research on it, which can explore a wider range of theories about mechanisms. It's very easy to say "oh this is poor quality research because it doesn't my favourite objection". I've probably fallen into this trap myself. However, the number of possible objections that could be made is often pretty large and if you never published until you addressed everything, you'd most likely never publish.
e) I worry that some of your skepticism of the risks manages to be persuasive by casting vague asperations that are disconnected from the actual strength of the arguments. You're like "oh, the future, the future, people are always saying it'll happen in the future", which probably sounds convincing to folks who haven't been following that closely, but it's a lot less persuasive if you know that we've been consistently seeing stronger results over time (in addition to a recent spike in anecdotes with the new reasoning models). This is just a natural part of the process, when you're trying to figure out how to conduct solid research in a new domain, of course it's going to take some time.
I think it's valuable for some people to say that it's a terrible idea in advance so they have credibility after things go wrong.
Whilst interesting, this post feels very assertive.
You claim that biological systems work by maintaining alignment as they scale. In what sense is this true?
You say that current methods lack a vision of a current whole. In what sense? There's something extremely elegant about pre-training to learn a world model, doing supervised learning to select a sub-distribution and using RL to develop past the human level. In what sense does this "lack a vision"?
I'm open to the possibility that we need to align a model as we make it more intelligent to prevent the agent sabotaging the process. But it's unclear from this article if this is why you want alignment first or for some other reason.
I really liked the analogy of taking actions, falling asleep then waking up (possibly with some modifications) and continuing.
I was already aware of your main point, but the way you've described it is a much clearer way of thinking about this.
Recently, the focus of mechanistic interpretability work has shifted to thinking about "representations", rather than strictly about entire algorithms
Recently? From what I can tell, this seems to have been a focus from the early days (1, 2).
That said, great post! I really appreciated your conceptual frames.
Collapsable boxes are amazing. You should consider using them in your posts.
They are a particularly nice way of providing a skippable aside. For example, filling in background information, answering an FAQ or including evidence to support an assertion.
Compared to footnotes, collapsable boxes are more prominent and are better suited to contain paragraphs or formatted text.
Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.
One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very diffuse factor. At the same time, a less diffuse intervention would be to increase the wisdom of specific actors.
You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll quickly end up with AI systems too capable for us to control".
I agree. In fact, a key reason why I think this is important is that we can't afford to leave anything on the table.
One of the things I like about the approach of training AI advisors is that humans can compensate for weaknesses in the AI system. In other words, I'm introducing a third category of labour human-AI cybernetic systems/centaur labour. I think that it's likely that this might widen the sweet spot, however, we have to make sure that we do this in a way that differentially benefits safety.
You do discuss the possibility of using AI to unlock enhanced human labour. It would also be possible to classify such centaur systems under this designation.
More broadly, I think there's merit to the cyborgism approach even if some of the arguments is less compelling in light of recent capabilities advances.
I think you missed: "Maybe we can reduce these propensities faster than the increasing optimisation power brings them out"
Regarding: "Unless you're posing a non-smooth model"
Why would the model be smooth when the we're making all kinds of changes to how the models are trained and how we elicit them? As an analogy, even if I was bullish on Nvidea stock prices over the long term, it doesn't mean that even a major crash would necessarily falsify my prediction as it could still recover.
My main disagreement is that I feel your certainty outstrips the strength of your arguments.