This post explores the concept of simulators in AI, particularly self-supervised models like GPT. Janus argues that GPT and similar models are best understood as simulators that can generate various simulacra, not as agents themselves. This framing helps explain many counterintuitive properties of language models. Powerful simulators could have major implications for AI capabilities and alignment.
On Carcinogenic Complexity, Software Senescence and Cognitive Provenance: Our roadmap for 2025 and beyond
It is mandatory to start any essay on AI in the post-ChatGPT era with the disclaimer that AI brings huge potential, and great risks. Unfortunately, on the path we are currently on, we will not realize those benefits, but are far more likely to simply drown in terrible AI slop, undermine systemic cybersecurity and blow ourselves up.
We believe AI on its current path will continue to progress exponentially, to the point where it can automate, and summarily replace, all of humanity. We are unlikely to survive such a transition.
Powerful technology always comes with powerful risks, but this does not mean we have to wait idly for the other shoe to drop. Risks can be managed and...
This essay seems to have lost the plot of where the problems with AI come from. I was historically happy that Conjecture focused on the parts of AI development that are really obviously bad, like having a decent chance of literally killing everyone or permanently disempowering humanity, but instead this seems like it's a random rant against AI-generated art, and name-calling of obviously valuable tools like AI coding assistants .
I am not sure what happened. I hope you find the plot again.
The most ambitious direction I've worked on in the past few years is a theory of hierarchical agency (as mentioned in the ACS announcement). Recently, I've become worried that the inferential distance between "the median of current AI safety discourse" and "what I and my colleagues at ACS work on" has become too large. After multiple attempts to write a good introduction to hierarchical agency ended up in perpetual draft form, I decided to try something else: explain it to Claude.
This is a lightly edited transcript of the first part of our conversation.
Claude
Shall we begin? Could you start by explaining what you mean by "hierarchical agency" and why you think it's important for AI safety?
Jan
Broadly, there is a pattern in the world where you can often see...
Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)
The reason I ask is because embedded agents and agents in multi-agent settings should need compositional world models that include models of themselves and other agents, which implies that hierarchical agency is included in what they would need to solve.
It also relates closely to work Vanessa is doing (as an "ARIA Creator") in learning theoretic AI, related to what she has cal...
This post comes a bit late with respect to the news cycle, but I argued in a recent interview that o1 is an unfortunate twist on LLM technologies, making them particularly unsafe compared to what we might otherwise have expected:
The basic argument is that the technology behind o1 doubles down on a reinforcement learning paradigm, which puts us closer to the world where we have to get the value specification exactly right in order to avert catastrophic outcomes.
Additionally, this technology takes us further from interpretability. If you ask GPT4 to produce a chain-of-thought (with prompts such as "reason step-by-step to arrive at an answer"), you know that in some sense, the natural-language reasoning you see in the output is how it...
Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.
One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into.
The ‘model organisms of misalignment’ line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he’s worked on at Anthropic under this agenda: “Sleeper Agents” and “Sycophancy to Subterfuge”.
Topics we discuss:
Suppose you’ve trained a really clever AI model, and you’re planning to deploy it in an agent scaffold that allows it to run code or take other actions. You’re worried that this model is scheming, and you’re worried that it might only need to take a small number of actions to get to a dangerous and hard-to-reverse situation like exfiltrating its own weights. Problems like these that the AI can cause with small numbers of actions are particularly hard to address with standard safety techniques, because by default you are unlikely to notice the problem until the hard-to-reverse outcome has already occurred.
One of the main classes of techniques that you could apply in the hope of preventing these catastrophic actions is what we call untrusted monitoring (introduced...
One aspect of untrusted monitoring which this post neglects, but which seems very important to me, is that we'll be uncertain about which model should be untrusted and the extent to which these models are untrusted.
This suggests that ensembling over untrusted monitors could be very promising (and much of the benefits of ensembling can be obtained without preventing collusion). Further, I expect that various types of ensembling make collusion harder even if all relevant models in the ensemble are scheming.
Ensembling untrusted monitors increases variance som...
Quick note: I occasionally run into arguments of the form "my research advances capabilities, but it advances alignment more than it advances capabilities, so it's good on net". I do not buy this argument, and think that in most such cases, this sort of research does more harm than good. (Cf. differential technological development.)
For a simplified version of my model as to why:
I would find this post easier to remember and link to if it were called "Serial vs Parallel Research Time", or something like that which points more at the particular insight the post provides.