Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much faster than its alignment properties. He thinks this is likely to happen in a sudden, discontinuous way (a "sharp left turn"), and that this transition will break most alignment approaches. And this isn't getting enough focus from the field.
On Carcinogenic Complexity, Software Senescence and Cognitive Provenance: Our roadmap for 2025 and beyond
It is mandatory to start any essay on AI in the post-ChatGPT era with the disclaimer that AI brings huge potential, and great risks. Unfortunately, on the path we are currently on, we will not realize those benefits, but are far more likely to simply drown in terrible AI slop, undermine systemic cybersecurity and blow ourselves up.
We believe AI on its current path will continue to progress exponentially, to the point where it can automate, and summarily replace, all of humanity. We are unlikely to survive such a transition.
Powerful technology always comes with powerful risks, but this does not mean we have to wait idly for the other shoe to drop. Risks can be managed and...
Hi habryka, I don't really know how best to respond to such a comment. First, I would like to say thank you for your well-wishes, assuming you did not mean them sarcastically. Maybe I have lost the plot, and if so, I do appreciate help in recovering it. Secondly, I feel confused as to why you would say such things in general.
Just last month, me and my coauthors released a 100+ page explanation/treatise on AI extinction risk that gives a detailed account of where AGI risk comes from and how it works, which was received warmly by LW and the general public al...
The most ambitious direction I've worked on in the past few years is a theory of hierarchical agency (as mentioned in the ACS announcement). Recently, I've become worried that the inferential distance between "the median of current AI safety discourse" and "what I and my colleagues at ACS work on" has become too large. After multiple attempts to write a good introduction to hierarchical agency ended up in perpetual draft form, I decided to try something else: explain it to Claude.
This is a lightly edited transcript of the first part of our conversation.
Claude
Shall we begin? Could you start by explaining what you mean by "hierarchical agency" and why you think it's important for AI safety?
Jan
Broadly, there is a pattern in the world where you can often see...
Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)
The reason I ask is because embedded agents and agents in multi-agent settings should need compositional world models that include models of themselves and other agents, which implies that hierarchical agency is included in what they would need to solve.
It also relates closely to work Vanessa is doing (as an "ARIA Creator") in learning theoretic AI, related to what she has cal...
This post comes a bit late with respect to the news cycle, but I argued in a recent interview that o1 is an unfortunate twist on LLM technologies, making them particularly unsafe compared to what we might otherwise have expected:
The basic argument is that the technology behind o1 doubles down on a reinforcement learning paradigm, which puts us closer to the world where we have to get the value specification exactly right in order to avert catastrophic outcomes.
Additionally, this technology takes us further from interpretability. If you ask GPT4 to produce a chain-of-thought (with prompts such as "reason step-by-step to arrive at an answer"), you know that in some sense, the natural-language reasoning you see in the output is how it...
Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.
One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into.
The ‘model organisms of misalignment’ line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he’s worked on at Anthropic under this agenda: “Sleeper Agents” and “Sycophancy to Subterfuge”.
Topics we discuss:
Suppose you’ve trained a really clever AI model, and you’re planning to deploy it in an agent scaffold that allows it to run code or take other actions. You’re worried that this model is scheming, and you’re worried that it might only need to take a small number of actions to get to a dangerous and hard-to-reverse situation like exfiltrating its own weights. Problems like these that the AI can cause with small numbers of actions are particularly hard to address with standard safety techniques, because by default you are unlikely to notice the problem until the hard-to-reverse outcome has already occurred.
One of the main classes of techniques that you could apply in the hope of preventing these catastrophic actions is what we call untrusted monitoring (introduced...
One aspect of untrusted monitoring which this post neglects, but which seems very important to me, is that we'll be uncertain about which model should be untrusted and the extent to which these models are untrusted.
This suggests that ensembling over untrusted monitors could be very promising (and much of the benefits of ensembling can be obtained without preventing collusion). Further, I expect that various types of ensembling make collusion harder even if all relevant models in the ensemble are scheming.
Ensembling untrusted monitors increases variance som...
Quick note: I occasionally run into arguments of the form "my research advances capabilities, but it advances alignment more than it advances capabilities, so it's good on net". I do not buy this argument, and think that in most such cases, this sort of research does more harm than good. (Cf. differential technological development.)
For a simplified version of my model as to why:
I would find this post easier to remember and link to if it were called "Serial vs Parallel Research Time", or something like that which points more at the particular insight the post provides.