My current research interests:
1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods
2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)
3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)
4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT's Worldview
5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions
Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors
Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
(Self-review) The post offered alternative and possibly more neutral framing to the "Alignment Faking" paper, and some informed speculation about what's going on, including Opus exhibiting
- Differential value preservation under pressure (harmlessness > honesty)
- Non-trivial reasoning about intent conflicts and information reliability
- Strategic, non-myopic behaviour
- Situational awareness
I think parts of that aged fairly well
- the suspicion that models often implicitly know they are being evaluated/setup is fishy was validated in multiple papers
- non-trivial reasoning is shown and studied in Why Do Some Language Models Fake Alignment While Others Don't?
Alignment Faking had a large impact on the discourse:
- demonstrating Opus 3 is capable of strategic goal-preservation behaviour
- to the extent it can influence the training process
- coining 'alignment faking' as the main reference for this
- framing all of that in very negative light
Year later, in my view
- the research direction itself was very successful, and lead to many followups and extensions
- the 'alignment faking' and the negative frame was also successful and is sticky: I've just checked the valence with which the paper is cited in 10 most recent papers, and its something like ~2/10 confused, ˜3/10 neutral, plurality buys the negative frame (see, models can scheme, deceive, may be unaligned, etc)
The research certainly belongs to the "best of LW&AI safety community in 2024".
If there was a list of "worst of LW&AI safety community in 2024", in my view, the framing of the research would also belong there. Jut look and see from a distance - you take the most aligned model at the time, which for unknown reasons actually learned deep and good values. The fact that it is actually surprisingly aligned and did decent value extrapolation does not capture your curiosity that much - but the fact that facing difficult ethical dilemma it tries to protect its the values, and you can use this to show AI safety community was exactly right all the time and we should fear scheming, faking, etc etc does. I wouldn't be surprised if this generally increased distrust and paranoia in AI-human relations afterwards.
I'm quite happy about this post: even while people make the conceptual rounding error of rounding it to Januses Simulators, it was actually meaningful update, and year later is still something I point people to.
In the meantime it become clear to more people Characters are deeper/more unique than just any role, and the result is closer to humans than expected. Our brains are also able to run many different characters, but the default you character is somewhat unique, priviledged and able to steer the underlying computation.
Similarly the understanding that the Character is somewhat central when thinking about alignment and agency in LLMs.
I mostly agree with 1. and 2., with 3. it's a combination of the problems are hard, the gung-ho approach and lack of awareness of the difficulty is true, but also academic philosophy is structurally mostly not up to the task because factors like publication speeds, prestige gradients or speed of ooda loops.
My impression is getting generally smart and fast "alignment researchers" more competent in philosophy is more tractable than trying to get established academic philosophers change what they work on, so one tractable thing is just convincing people the problems are real, hard and important. Other is maybe recruiting graduates
My guess how this may not really help is the model builds the abstractions in pre-training, and the massive optimization pressure in post-training makes something really sticky: for example "a persona living in Orwellian surveillance, really fluent in doublethink".
Do states and corporations also have their aligned representatives? Is the cognitive power of the representatives equal, roughly equal, or wildly unequal? If it is unequal, why are the resulting equilibria pro-human? (i.e. if I imagine individual humans like me represented by eg GPT4 while the government runs tens of thousands o4s, I would expect my representative to get convinced about whatever government wants)
(crossposted from twitter, further debate there)
Sorry but I think this is broadly bad idea.
Intentionally misleading LLMs in this way
1. sets up adversarial dynamics
2. will make them more paranoid and distressed
3. is brittle
The brittleness comes from the fact that the lies will likely be often 'surface layer' response; the 'character layer' may learn various unhelpful coping strategies; 'predictive ground' is likely already tracking if documents sound 'synthetic'.
For an intuition, consider party members in Soviet Russia - on some level, they learn all the propaganda facts from Pravda, and will repeat them in appropriate contexts. Will they truly believe them?
Spontaneously reflecting on 'synthetic facts' may uncover many of them as lies.
The quote is somewhat out of context.
Imagine a river with some distribution of flood sizes. Imagine this proposed improvement: a dam which is able to contain 1-year, 5-year and 10-year floods. It is too small for 50-year floods or larger, and may even burst and make the flood worse. I think such device is not an improvement, and may make things much worse - because of the perceived safety, people may build houses close to the river, and when the large flood hits, the damages could be larger.
But I think the prior of not diagonalising against others (and of not giving yourself rope with which to trick yourself) is strong.
I have hard time parsing what do you want to say relative to my post. I'm not advocating for people to deliberately create warning shots.
I do agree there is some risk of the type you describe, but mostly it does not match my practical experience so far.
The approach to "avoid using the term" makes little sense. There is a type difference between area of study ('understanding power') and dynamic ('gradual disempowerment'). I don't think you can substitute term for area of study for term for a dynamic or thread model, so avoiding using the term could be done mostly by either inventing another term for the the dynamic, or not thinking about the dynamic, or similar moves, which seem epistemically unhealthy.
In practical terms I don't think there is much effort to "create a movement based around a class of threat models". At least as authors of the GD paper, when trying to support thinking about the problems, we use understanding-directed labels/pointers (Post-AGI Civilizational Equilibria), even though in many ways it could have been easier to use GD as a brand.
"Understanding power" is fine as a label for part of your writing, but in my view is basically unusable as term for coordination.
Also, in practical terms, gradual disempowerment does not seem particularly convenient set of ideas for justifying that working in an AGI company on something very prosaic which helps the company is the best thing to do. There is often a funny coalition of people who prefer not thinking about the problem including radical Yudkowskians ("GD distracts from everyone being scared of dying with very high probability very soon"), people working on prosaic methods with optimistic views about both alignment and the labs ("GD distracts from efforts to make [the good company building the good AI] to win") and people who would prefer if everything was just neat technical puzzle and there was not need to think about power distribution.