Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)
The reason I ask is because embedded agents and agents in multi-agent settings should need compositional world models that include models of themselves and other agents, which implies that hierarchical agency is included in what they would need to solve.
It also relates closely to work Vanessa is doing (as an "ARIA Creator") in learning theoretic AI, related to what she has called "Frugal Compositional Languages" and see this work by @alcatal - though I understand both are not yet addressing on multi-agent world models, nor is it explicitly about modeling the agents themselves in a compositional / embedded agent way, though those are presumably desiderata.
The most ambitious direction I've worked on in the past few years is a theory of hierarchical agency (as mentioned in the ACS announcement). Recently, I've become worried that the inferential distance between "the median of current AI safety discourse" and "what I and my colleagues at ACS work on" has become too large. After multiple attempts to write a good introduction to hierarchical agency ended up in perpetual draft form, I decided to try something else: explain it to Claude.
This is a lightly edited transcript of the first part of our conversation.
Claude
Shall we begin? Could you start by explaining what you mean by "hierarchical agency" and why you think it's important for AI safety?
Jan
Broadly, there is a pattern in the world where you can often see agents composed of other agents. Like, corporations and their departments, states and citiziens, and so on.
Claude
So you're looking at nested or layered structures of agency - where agents can contain or be made up of other agents. Let me test my understanding with some examples and their properties:
Would biological examples also fit this pattern? Like how a human body could be seen as an agent composed of organs (which have their own regulatory systems) and cells (which make their own decisions about things like protein production)? I'm curious about what makes this perspective particularly relevant for AI safety
Jan
I hope to get to AI safety relevance later.
Yes, biological examples fit well - particularly multicellular organisms, which demonstrate emergent agency at multiple scales. To better define the boundaries of hierarchical agency, it might help to examine some edge cases and non-examples.
Claude
Let me probe the boundaries systematically by analyzing different types of systems:
Am I on the right track with these examples? What makes some collections of agents form a meaningful higher-level agent while others don't?
Jan
The question of what makes something a real superagent versus just a bunch of agents hanging out together is super interesting.
I don't have a satisfactory answer yet, but empirically, you can ask it makes sense to model the collective as an agent. I usually operationalize this using Dennett's three stances perspective - do you know these?
Claude
Yes! If I recall correctly, Dennett proposed three levels or stances for understanding/predicting behavior:
So, if I'm understanding you correctly, you're suggesting that what makes something a "higher-level agent" is whether taking the intentional stance towards the collective is actually useful/predictive?
Let me test this understanding:
Jan
You get it. In thinking about hierarchical agency, I'm interested in systems where there are at least two layers where it is useful to think about both layers as agentic, let's say a superagent composed of subagents.
Now, having this concept, I want to make a bold claim: I think a critical missing piece for AI safety and AI alignment is we lack a good mathematical formalism for describing these systems - at least as far as I know.
You would probably ask what I mean by 'good', so I'll spell it outright. My desiderata are
Claude
Let me break this down to make sure I understand your criteria:
Type consistency across levels:
Could you confirm if this example captures what you mean:
Good: A formalism where both OpenAI (superagent) and employees (subagents) are represented by the same type of mathematical object
Not good: A formalism where individual workers are "agents" but their collective becomes a different mathematical type like "contract"
Also, I'm curious - have you seen formalisms that satisfy some but not all of these criteria?
Jan
Actually, you have better overview of all mathematical formalisms invented: you read it all, so maybe you can try to come up with some of them and check if they fulfill the criteria?
Claude
Let me think about mathematical formalisms I'm aware of that might be relevant:
Game Theory:
Category Theory:
But: Abstract away too much of the intentional content? Real world relevance?
Classical Multi-agent Systems:
But: Usually flat structure, not hierarchical?
Markov Decision Processes:
But: More focused on actions/policies than beliefs/goals?
Economic Mechanism Design:
But: Different mathematical objects at different levels (agents vs mechanisms)
Am I missing any obvious formalisms? Would combining aspects of these get closer to what you're looking for?
Jan
Clearly related field is public choice. Less obvious, active inference
Claude
Let's explore these:
Public Choice Theory:
But: Does it typically treat the collective level as the same type of agent as individuals?
Active Inference:
Has a unified mathematical framework across levels
This seems potentially promising!
Let me try to understand why active inference might be relevant here:
Am I getting warm? Could you explain what aspects of these frameworks you find relevant to hierarchical agency?
Also, this makes me wonder: are you looking for a formalism that could capture how different levels of agency might have different models/beliefs about the world and each other? Like how a corporation might have a different "view" of reality than its individual employees?
Jan
Yes. You are actually pretty good at this.
What I like about active inference in this context is it tries to express everything as the same type - roughly "beliefs". Also, in its more neuro-inspired versions, assumes bidirectional control/prediction flows is what happens in human brain.
Yes: I want something which is expressive enough to allow different beliefs on different levels. Also I want something allowing to have "goal-oriented beliefs" or "goals" concerning other layers. For example, corporation having an instrumental goal of their employees being more loyal.
Given you get this, I think we can move to your previous question "how is this related to AI safety" - or we should clarify more?
Claude
Yes, I think I understand enough of the core concept to explore its relevance to AI safety.
Let me venture a guess about some potential connections to AI safety:
Am I on the right track?
Jan
Broadly correct direction in case of 1. and 2., but actually I want to describe more ambitious vision.
How I think about it is using a metaphor of "game theory".
Think about how humans understood cooperation and conflict before the mathematical formalization by von Neumann, Schelling, and others. There was lots of informal understanding scattered across diverse domains - military strategy books, gambling theory, board games, even ethical treatises. I suspect most people at the time didn't see this as an area where mathematics could be useful - sure, math worked great for physics, but concepts like cooperation or defection seemed like fuzzy human things. Yet I imagine a mathematically-minded person looking across these domains might have noticed something interesting: similar patterns of incomplete understanding showing up in contexts as different as warfare and board games, suggesting a deeper mathematical structure waiting to be discovered.
Claude
Let me see if I understand the parallel you're drawing:
Before game theory, understanding of strategic interaction was:
You see hierarchical agency in a similar position now:
And you're suggesting that just as game theory provided a unified mathematical framework that:
You're aiming for a similar mathematical unification of hierarchical agency?
Is this capturing your vision correctly? And if so, what do you see as the "similarly shaped holes" across different domains that your formalism would help fill?
Jan
Yes, you got this right. The "similarly shaped holes" are a bit of a rabbit hole we can spend a lot of time on. I'd like to connect this to AI safety in a way that hopefully makes the relevance clear.
One of the safety relevant holes is "how AI deals with different objectives". You can imagine in the training AI learns about different goals, and the result is you can think about the AI being composed of various "smaller AIs", tracking different things. For example, let's say part of the AI "cares" or "predicts" or whatever you prefer ... answers oriented toward helpfulness to the human user. Another about "harmlessness". And so on. Further, you can think about the AI undergoing some process where these parts interact, and the "whole" evolves. For example, you can imagine yourself having these various drives, and preparing data for your further training.
Claude
Let me try to unpack this through a few lenses to make sure I understand:
Internal structure of AI systems:
The whole system emerges from these interactions
So for example:
And these interact to produce behavior?
Training dynamics: If I understand correctly, you're suggesting that:
During training, different "parts" might develop to track different objectives
These parts then interact/evolve
The AI might even participate in shaping its own further training
Let me test my understanding with a concrete example:
When I (Claude) respond to a user, there might be different "subagents" considering:
And my responses emerge from some interaction between these?
Is this the kind of dynamic you're trying to formalize? And would your formalism help us understand:
Am I getting closer to what you mean by this particular "hole" that needs filling?
Jan
Yes, a bit more the points 2,3 and 4. I hope the formalism may help with problems like
... what Richard Ngo calls "Value systematization" or the problem me I described in "Self-unalignment problem", where the AI faces the problem how to align with self-unaligned human. Do you vaguely remember these?
Claude
I will try to recall and connect these ideas:
From what I remember, "Value systematization" refers to the challenge of how an AI system could/should resolve conflicts or contradictions in human values - not just implementing fixed values, but engaging in some kind of philosophical reasoning about values.
The "Self-unalignment problem" deals with how an AI should handle the fact that humans themselves often have conflicting values, aren't fully aligned with their own stated goals, and change over time.
So if I'm connecting this to hierarchical agency:
Would your formalism help by providing a way to:
Jan
Yes: my guess is a good formalism would help with all of these. In essence, I'm seeking a game theory-esque breakthrough for hierarchical agency — a simple yet powerful formalism that captures the essence of multi-layered intentional interactions.
Claude
So, you believe a good formalism for hierarchical agency could:
Would it be fair to say you're looking for the hierarchical agency equivalent of concepts like Nash equilibrium - simple but profound ideas that emerge once you have the right mathematical framework?
I'm curious - do you see any candidate concepts that might play this role? Things that seem like they might be fundamental to understanding hierarchical agency, but we lack the formal tools to properly analyze them yet?
Jan
Yes. I hope to return to the formal concepts later. Less formally, I think something like Holy grail here is to have formalization which is suitable for expressing concepts like 'kindness'.
I often think it makes sense to think about individual human mind using the hierarchical agency frame. And, in this case, empirically, I would say people have different internal architectures. As a human, you (the superagent) can do different things with your internal conflicts. For example, we sometimes try to supress or even delete some of our parts. Some people seem to have something like "internal dictatorship", where a lot of their motivations or parts are suppressed. Personally, I strive for cooperative, win-win relationships, where my different desires "trust the whole".
Ideally, I would like something where we express thing like the superagent being kind to the subagents, or a dynamic of value evolution we would consider fair.
Claude
I'll try to unpack several layers here:
Internal Architecture Patterns:
Am I right that you see these as different ways the superagent-subagent relationship can be structured?
Values at Meta-Level:
If I understand correctly, you're suggesting we need formalism that can express not just object-level values, but meta-level values about how different levels of agency should interact. Like:
This makes me think about therapeutic approaches to internal conflict. For example:
Would your formalism aim to capture what makes some of these approaches "healthier" or "more ethical" than others? And could this then inform how we think about AI systems managing their internal conflicts or interacting with humans?
Is this getting at what you mean by expressing values like "kindness" in the context of hierarchical agency?
Jan
Yes.
To be continued.
I'm interested in feedback about the format: is it useful? Would be better if the conversation was more or less condensed? If Claude was simulating someone more or less informed?
This has various downsides, like difficulty in coordinating with other people, incomprehensibility to funders, etc.