I think it's valuable for some people to say that it's a terrible idea in advance so they have credibility after things go wrong.
Whilst interesting, this post feels very assertive.
You claim that biological systems work by maintaining alignment as they scale. In what sense is this true?
You say that current methods lack a vision of a current whole. In what sense? There's something extremely elegant about pre-training to learn a world model, doing supervised learning to select a sub-distribution and using RL to develop past the human level. In what sense does this "lack a vision"?
I'm open to the possibility that we need to align a model as we make it more intelligent to prevent the agent sabotaging the process. But it's unclear from this article if this is why you want alignment first or for some other reason.
I really liked the analogy of taking actions, falling asleep then waking up (possibly with some modifications) and continuing.
I was already aware of your main point, but the way you've described it is a much clearer way of thinking about this.
Recently, the focus of mechanistic interpretability work has shifted to thinking about "representations", rather than strictly about entire algorithms
Recently? From what I can tell, this seems to have been a focus from the early days (1, 2).
That said, great post! I really appreciated your conceptual frames.
Collapsable boxes are amazing. You should consider using them in your posts.
They are a particularly nice way of providing a skippable aside. For example, filling in background information, answering an FAQ or including evidence to support an assertion.
Compared to footnotes, collapsable boxes are more prominent and are better suited to contain paragraphs or formatted text.
Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.
One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very diffuse factor. At the same time, a less diffuse intervention would be to increase the wisdom of specific actors.
You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll quickly end up with AI systems too capable for us to control".
I agree. In fact, a key reason why I think this is important is that we can't afford to leave anything on the table.
One of the things I like about the approach of training AI advisors is that humans can compensate for weaknesses in the AI system. In other words, I'm introducing a third category of labour human-AI cybernetic systems/centaur labour. I think that it's likely that this might widen the sweet spot, however, we have to make sure that we do this in a way that differentially benefits safety.
You do discuss the possibility of using AI to unlock enhanced human labour. It would also be possible to classify such centaur systems under this designation.
More broadly, I think there's merit to the cyborgism approach even if some of the arguments is less compelling in light of recent capabilities advances.
Lots of interesting ideas here, but the connection to alignment still seems a bit vague.
Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values? It seems to me that an unaligned ASI is extremely sensitive to context, just in the service of its own goals.
Then again, maybe you see Live Theory as being more about figuring out what the outer objective should look like (broad principles that are then localised to specific contexts) rather than about figuring out how to ensure an AI internalises specific values. And I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.
That would make the domain of checkable tasks rather small.
That said, it may not matter depending on the capability you want to measure.
If you want to make the AI hack a computer to turn the entire screen green and it skips a pixel so as to avoid completing the task, well it would have still demonstrated that it possesses the dangerous capability, so it has no reason to sandbag.
On the other hand, if you are trying to see if it has a capability that you wish it use, it can still sandbag.
Points for creativity, though I'm still somewhat skeptical about the viability of this strategy,
Lots of fascinating points, however:
a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it's also worth flagging that there's less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you've neglected the potential for us to apply filtering to the training data. Whilst I don't think the solution will be that simple, I don't think the relation is quite as straightforward as you claim.
c) The discussion of "how do you think the LLM's feel about these experiments" is interesting, but it is also overly anthromorphic. LLM's are anthromorphic to a certain extent having been trained on human data, but it is still mistaken to run a purely anthromorphic analysis.
d) Whilst you make a good point in terms of how the artificiality of the scenario might be affecting the experiment, I feel you're being overly critical of some of research into how models might misbehave. Single papers are rarely definitive and often there's value in just showing a phenomenon exists in order to spur further research on it, which can explore a wider range of theories about mechanisms. It's very easy to say "oh this is poor quality research because it doesn't my favourite objection". I've probably fallen into this trap myself. However, the number of possible objections that could be made is often pretty large and if you never published until you addressed everything, you'd most likely never publish.
e) I worry that some of your skepticism of the risks manages to be persuasive by casting vague asperations that are disconnected from the actual strength of the arguments. You're like "oh, the future, the future, people are always saying it'll happen in the future", which probably sounds convincing to folks who haven't been following that closely, but it's a lot less persuasive if you know that we've been consistently seeing stronger results over time (in addition to a recent spike in anecdotes with the new reasoning models). This is just a natural part of the process, when you're trying to figure out how to conduct solid research in a new domain, of course it's going to take some time.