“Act-based approval-directed agents”, for IDA skeptics

Steven Byrnes

Summary / tl;dr

In the 2010s, Paul Christiano built an extensive body of work on AI alignment—see the “Iterated Amplification” series for a curated overview as of 2018.

One foundation of this program was an intuition that it should be possible to build “act-based approval-directed agents” (“approval-directed agents” for short). These AGIs, for example, would not lie to their human supervisors, because their human supervisors wouldn’t want them to lie, and these AGIs would only do things that their human supervisors would want them to do. (It sounds much simpler than it is!)

Another foundation of this program was a set of algorithmic approaches, Iterated Distillation and Amplification (IDA), that supposedly offers a path to actually building these approval-directed AI agents.

I am (and have always been) a skeptic of IDA: I just don’t think any of those algorithms would work very well.^[1]

But I still think there might be something to the “approval-directed agents” intuition. And we should be careful not to throw out the baby with the bathwater.

So my goal in this post is to rescue the “approval-directed agents” idea from its IDA baggage. Here’s the roadmap:

In Section 1, I offer a high-level picture of what we’re hoping to get out of “approval-directed agents”, following a discussion by Abram Demski (2018).

In Section 2, I walk through an example of how this vision can actually manifest in the context of brain-like AGI, a different AI paradigm which (unlike IDA) can definitely scale to superintelligence. I offer an everyday example of having role-models / idols who celebrate honesty, and correspondingly taking pride in one’s self-image as an honest person. In terms of brain algorithms, I relate this phenomenon to (what I call) “Approval Reward”, a hypothesized component of the human brain’s innate reinforcement learning reward function.

1. The easy and hard problems of wireheading, observation-utility agents, and approval-directed agents

In “Stable Pointers to Value II: Environmental Goals” (2018), Abram Demski describes the “observation-utility agents” trick^[2] to solve (what he calls) “the easy problem of wireheading”.

Abram then suggests that we can think of Paul Christiano’s idea of “approval-directed agents” as a second, analogous move in this same direction:

Abram calls the (c) failure mode “the hard problem of wireheading”; it includes all the ways to manipulate and deceive the human. The hope would be that (d) is an elegant solution to “the hard problem of wireheading” in (c), just as (b) is an elegant solution to “the easy problem of wireheading” in (a). After all, they have an obvious structural similarity.

Seems promising on paper, but how would we make these work in practice?

For the “hard problem” (d vs c) in particular, Abram mentions two challenges:

First, there’s an alignment issue. My diagram above obviously can’t be taken literally, with a literal human inspecting AI plans. For one thing, inspecting even one plan would be difficult and time-consuming at best, and impossible at worst, because the plans will be defined in terms of the AI’s inscrutable world-model. For another, we would probably need billions or trillions of plan-evaluation steps to happen, far beyond our ability to hire human plan-evaluators. After all, even a single human going about his day will entertain multiple plans per second, i.e. millions per year, and we’ll need a great many person-years of AGI labor if we want to move the needle on the AI x-risk problem.

So instead of an actual human supervisor in (d), we need some learned substitute. How do we get it? And more to the point, what happens when the learned substitute comes apart from the ground truth? That’s the first problem.

Second, there’s a capabilities (“alignment tax”) issue. The human supervisor is out in the environment in (c), but the (learned imitation of the) human supervisor is brought inside the AI’s thought process in (d). So in (d), much more than (c), we seem to be deeply constrained by the human supervisor’s competence and knowledge. For example, if the AI is supposed to be inventing futuristic nanotechnology, it might be entertaining plans like “What if I try exploring metastable covalent plasma flux resonances?” Alas, we can’t rely on the (learned imitation of the) current human supervisor to evaluate that plan, because the current human supervisor has no idea what the heck “metastable covalent plasma flux resonances” even means. So, how is this supposed to work?

Paul Christiano’s IDA-related 2010s research offers various ideas for addressing these two problems, which I basically don’t buy (see above). But here’s a quite different perspective on the problem:

2. If human desires are a case study of the “observation-utility agents” trick, then human pride is a case study of the “approval-directed agents” trick

For “the easy problem of wireheading” (b vs a), I argued in my Intro series §9.5.2 (2022) that human brains are more-or-less “observation-utility agents” in the above sense. (And indeed, plenty of humans would choose to not wirehead, given the choice.)

Well, now four years later, I’m proposing that human brains also provide an illustration of the “approval-directed agents” trick—specifically, when humans act out of pride in our self-image.

Consider a person who takes pride in their honesty. When they think of themselves as being honest, they feel pride, which comes along with an immediate squirt of pleasant feelings in their brain. I claim that this squirt of pleasant feelings is the result of an innate drive (a.k.a. primary reward) that I call “Approval Reward”. See my post Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking (2025) (especially §3), for more on this everyday phenomenon, and see my post Neuroscience of human social instincts: a sketch (2024) for gory details of how I think this mechanism works in the brain. (I.e., how does the brain know which thoughts / plans do or don’t merit a squirt of Approval Reward?)^[3]

I claim that if a person (call him Alex) has pride in his honesty, then upstream of that is at least one person whom Alex greatly admires, who thinks that it’s good to be honest and bad to be dishonest.^[4] I’ll pick the name “Hugh” for this honesty-loving person whom Alex admires, and I’ll assume for now that they’re an actual living person (as opposed to a cartoon character, or Jesus, etc.).

Alex would love to get Hugh’s actual approval in real life—indeed, getting a few words of approval from someone you greatly admire can be a life-changing experience.^[5] But Alex would not want to get Hugh’s approval by deceptively tricking Hugh into thinking that Alex is honest!

Yes, the plan to trick Hugh would impress Hugh in the future, when this plan comes to fruition. But merely entertaining this plan is appalling right now to imagined-Hugh, who is living rent-free in Alex’s brain, as Alex thinks about what to do. So this plan-to-deceive seems bad, and Alex won’t do it.^[6]

Thus, imagined-Hugh has inserted himself into the plan-evaluation slot of Alex’s optimization loop, and this is working to prevent real-Hugh-manipulating strategies! It’s just like (d) in the diagram above! Here’s the corresponding diagram:

So now we have human analogies for not only the “observation-utility agents” trick, but also the “approval-directed agents” trick. And that’s great! It elevates these ideas from “things that sound maybe plausible” to “plans which are clearly compatible in practice with powerful general intelligence”, and which moreover I feel competent to analyze in detail on a nuts-and-bolts level.

Thus, if the above dynamic is a thing that can happen in human brains, then maybe something like it is likewise possible in brain-like AGI! For example, perhaps “Alex” is a stand-in for the AI, “Hugh” for the human supervisor, and “honesty” for (perhaps) some broader bundle of honesty, loyalty, obedience, forthrightness, integrity, etc. (cf. Paul’s broad notion of corrigibility).

…And then what? What exactly do we learn, from this human analogy, about what might go right or wrong? Is there a way to solve those two problems listed in §1 above? Is there an end-to-end path to safe & beneficial AGI somewhere in here?

My answer is: I don’t know! More on this in future posts, hopefully. :-)

Thanks Seth Herd for critical comments on an earlier draft.

^{^}
More specifically: Sooner or later, one way or another, someone will invent radically superintelligent AI, and that AI might kill everyone. That’s the big problem that I’m interested in. And I think these IDA AIs would not be powerful enough to play an important role in either that problem or its solution. At most they’d be relevant as background context, similar to internet search engines or PyTorch or World War III etc.
…And why do I think that IDAs won’t be more powerful than that? Well, I had a whole section about this in an earlier draft, but it got really long and felt like an off-topic digression, so I cut it. But then I repackaged part of it and turned it into a separate post: You can’t imitation-learn how to continual-learn. Anyway, if you read that link, alongside earlier discussions by Eliezer here and John Wentworth here, then you can pretty much piece together the gist of where I’m coming from.
^{^}
Abram attributes the trick to Daniel Dewey 2011, who in turn attributes it to Nick Hay 2005, which I didn’t read. Later on in 2019, this idea was analyzed in great detail by Everitt, Hutter, Kumar, and Krakovna, who called it “current-RF optimization”—see arxiv link, blog version, lesswrong crosspost.
^{^}
See also: 6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa (2025) for even more perspectives on Approval Reward.
^{^}
Of course, what actually matters is that Alex believes that his idol feels that honesty is important and good. Whether they actually do is a different matter. As the saying goes, “never meet your heroes”.
^{^}
See Mentorship, Management, and Mysterious Old Wizards, and also most of the examples in the “feelgood” email folder anecdote here.
^{^}
Or maybe he will anyway! But my point is: this is a real consideration that pushes Alex in the direction of disliking that plan. If he does the plan anyway, it would have to be because there were other countervailing considerations that outvoted it.

Second, there’s a capabilities (“alignment tax”) issue. The human supervisor is out in the environment in (c), but the (learned imitation of the) human supervisor is brought inside the AI’s thought process in (d). So in (d), much more than (c), we seem to be deeply constrained by the human supervisor’s competence and knowledge.

Is this why the smartest humans (e.g. John von Neumann, Terrance Tao) go into math, where verification is definitely easier than generation, instead of fields like philosophy and long-horizon strategy, where plans and outputs are much harder to judge by others? (JvN did do some philosophy and strategy, but surprisingly little relative to his abilities and interests, and I note that his philosophical work, in decision theory, was heavily math flavored.)

I don’t think this is related to the points I was making in the post … But happy to chat about that anyway.

Yeah sure, common sense says that smart people will tend to enjoy being in more meritocratic intellectual fields, rather than less meritocratic ones, and also that fields in general tend to be more meritocratic when quality is easy to judge (although other things matter too, e.g. glamorous fields have it tougher because they attract grifters).

See e.g. what I wrote here about experimental science.

The mathematics community has successfully kept the cranks out, as far as I know, but two grimly amusing failures (in my controversial opinion) are: (1) in the 2000s, the (correct) theoretical physics consensus that we should be focusing on string theory was somewhat broken by an invasion of people unable to tell good physics theory from bad (e.g. “loop quantum gravity”), and there were enough of such people (including department chairs etc) that they broke the blockade and wound up with positions and credentials; (2) this funny anecdote in Dan Dennett’s memoir:

The hegemony of the analytic philosophers evaporated in 1979, at the Eastern Division meeting of the APA [American Philosophical Association] in Boston, when a coup d’état was staged by a group of mostly American but Continental philosophers who called themselves pluralists (let a thousand flowers bloom). I wonder how many of today’s young philosophers and graduate students have ever heard about this. It was an academic earthquake at the time. Frustrated by the short shrift given them by members of the “analytic monolith,” these philosophers studied the bylaws of the APA and discovered that although for decades the nominating committee had put forward a single candidate for vice president who was then elected by acclaim and would succeed as president the following year, the rules allowed nominations from the floor and actual elections! In secret, the pluralists put together their slate, prepared their challenges to the parliamentarian and other officers, and made sure their members were all set to descend en masse on the lightly attended business meeting and take over the APA Eastern Division. About half an hour before the meeting, their security broke down: a coup was rumored to be in the offing, and we monolith members were rounded up in the bar and hustled to the meeting to try to fend off the usurpation. Dick Rorty was president that year, and it was an irony (one of his favorite topics) that he—the most ecumenical and open-minded of the “analytic monolith” leaders—presided over the meeting, while Tom Nagel executed his duties as parliamentarian with aplomb. There were nominating speeches and rebuttals, the most memorable of which was by Ruth Marcus, whose Yale colleague John Smith, a philosopher of religion and a theologian, was the pluralists’ candidate. She explicitly trashed his whole career, his character, his books. I had never heard a philosopher speak so ill of a colleague in public, and seldom in private.
We lost. The establishment had nominated Adolf Grünbaum, a Pittsburgh philosopher of science, to be the new vice president. Not wanting to offend innocent Adolf, the victorious pluralists nominated and elected him vice president the following year, so that in 1982 he finally got to deliver the presidential address he had expected to give earlier. He did not accept the olive branch with equanimity. Adolf was famous for his tirades against Freud as an unscientific poseur, and his address was vintage Grünbaum. I happened to follow a cluster of pluralists out of the hall at the close of his address and overheard the reply when a pluralist who had stayed away asked how Grünbaum’s address had gone: “It was nasty, brutish and long.”
Thereafter, the APA’s programs were filled with papers on topics, and by philosophers, that would never have made the cut before the pluralist coup. Was this a good thing? Yes, said some monolith members, since it meant there was more guilt-free time to spend in the bar at conventions. Yes, said others, since the pluralists had justice on their side. My verdict is mixed. Still, the published programs of the APA meetings list dozens of talks whose titles are so ripe for parody that when I recently perused a few looking for likely examples to anonymize, I had difficulty “improving” on the actual candidates, but ask yourself whether you are aching to go to the sessions where the following talks will be given:
“The Ineffability of History and the Problem of the Unitary Self”
“Dialectical Encroachment: Humiliation and Integrity”
“Can Relationalistic Ontology Avoid Incoherence through a Recursive Metatheory?”
“Art as War: The Resilience of Autonomy”

Having said all that…

If your proposal is:

von Neumann and Tao did math-y stuff rather than other stuff because they got adulation when they did math-y stuff and they got heckled by idiots when they did other stuff.

…then I think that’s part of it but not all of it. I would note that they presumably got good at math by thinking about math all the time, and if they were thinking about math all the time, it’s probably because they found it very satisfying and enjoyable to think about math. I have a kid like that—when he was like 8 years old, I might be talking about politics at dinner or whatever, and he would interrupt me to share something he just thought of about perfect squares that he found very exciting. I.e., some people, when their mind is wandering, think about other people, and some people think about sports, and he was evidently thinking about perfect squares. Anyway, if a person intrinsically enjoys thinking about numbers and symbols, then it stands to reason that they would probably choose a career where they get to think about numbers and symbols all day.

I sometimes wonder why the AI x-risk community was so overrepresented in physicists in the early-ish days (e.g. Hawking, Tegmark, Wilczek, Musk, Tallinn, Rees, Omohundro, Aguirre…). The best I can come up with is that people who self-select into physics are unusually likely to have the combination of (1) smart & quantitative, and (2) really, deeply, profoundly bothered by not understanding important things about the world.

I don’t think this is related to the points I was making in the post …

To spell out the relevance that I see, if the same "alignment tax" issue that you mentioned for approval-directed AIs occurs in humans, that means we can't use humans as an "existence proof" that this problem is solvable, while at the same time if somebody was to come up with a solution to the problem for AIs, the same solution could plausibly be "back-ported" to humans and allow the smartest humans to be more productive in some especially important fields (like philosophy and long-horizon strategy).

I think the point I was trying to make in this post is both narrower and weirder than the general topics of humans supervising more competent AIs, and generation-verification gaps. For example, my self-image might be partly formed from admiration of the character traits of a cartoon character, or Jesus, etc., and I might feel pride in acting in ways that I imagine them approving of, and that might influence how I go about my day-to-day conduct as a string theory researcher. But Jesus is long gone, and the cartoon character doesn’t even exist at all, and certainly neither was able to evaluate string theory ideas. They’re not “supervising” me in that sense.

Actual humans supervising actual AGIs is something that Paul talked about in IDA stuff, and like I said in the OP, I reject that entire line of research as a dead end.

Separately, I agree that “humans are an existence proof that safe & beneficial brain-like AGI is possible in principle” needs a heavy dose of nuance and caveats (humans are working towards misaligned AGI right now, plus I’d generally expect tech progress to drive humanity off the rails even without AGI or other destructive tech, among other things). But I do think there is some “…existence proof…” argument that goes through. E.g. at least some humans are making the overall situation better not worse (or if not, then we’re screwed no matter what), and AGIs don’t have to match the human population distribution.

But Jesus is long gone, and the cartoon character doesn’t even exist at all, and certainly neither was able to evaluate string theory ideas. They’re not “supervising” me in that sense.

Oh I didn't realize this was your main point. To connect this to my most salient problem, namely how to improve production of philosophy and long-term strategy, I can't think of anyone who is working in these areas and primarily motivated by the imagined approval of fictional or historical characters. Instead I think they're mainly trying to win approval of other actual humans.

Do you think that nevertheless fictional approval (is this a good phrase to describe your idea?) is a promising avenue to pursue, for humans and/or AIs? A potential problem is that I don't see how to ground it, i.e., if the imagined approval diverges from what's actually good, there is no feedback loop to correct it?

But I do think there is some “…existence proof…” argument that goes through. E.g. at least some humans are making the overall situation better not worse (or if not, then we’re screwed no matter what), and AGIs don’t have to match the human population distribution.

It occurs to me that "at least some humans are making the overall situation better not worse" could be true, but a necessary factor is the constraints those humans have, e.g. limited intelligence, which can't be reproduced in AIs. (If you limit your AI's intelligence to make it safer / more aligned, someone will just copy your design and remove the limit.) E.g., maybe if I had von Neumann level IQ, I'd also be working in easy-to-verify domains like math and computer hardware, instead of philosophy and long-term strategy.

This post contains no plan for technical AGI alignment (or anything else). I have no such plan. See the last two paragraphs of the post.

I am trying to find such a plan (or prove that none exists), and in the course of doing so, occasionally I come across a nugget of deconfusion that I want to share :-) Hence this post.

As a general rule, I take interest in certain things that humans sometimes do or want, not because I’m interested in copying those things directly into AGIs, but rather because they are illustrative case studies for building my nuts-and-bolts understanding of aspects of motivation and learning etc. And then I can use that understanding to try to dream up some engineered system that might be useful in AGIs. The resulting engineered system might or might not resemble anything in humans or biology. By analogy, the Wright Brothers learned a lot from soaring birds, but their plane did not look like a bird.

I think they're mainly trying to win approval of other actual humans.

I think what people “mainly” do is not of much interest to me right now. If a few people sometimes do X, then it follows that X is a possible thing that a brain can do, and then I can go try to figure out how the brain does that, and maybe learn something useful for technical alignment of brain-like AGI.

So along those lines: I think that there exist people who have a self-image as a person with such-and-such virtue, and take pride in that, and will (sometimes) make decisions driven by that self-image even when they have high confidence that nobody will ever find out, or worse, when they have high confidence that the people they care most about will despise them for it. They (sometimes) make that decision anyway.

I think this kind of self-image-related motivation has a deep connection to other people’s approval, and is causally downstream of their experience of such approval over a lifetime. But it is definitely NOT the same as consequentialist planning to maximize future approval / status.