If anyone wants to have a voice chat with me about a topic that I'm interested in (see my recent post/comment history to get a sense), please contact me via PM.
My main "claims to fame":
We can view the problem as a proving ground for ideas and techniques to be later applied to the AI alignment problem at large.
Do you have any examples of such ideas and techniques? Are any of the ideas and techniques in your paper potentially applicable to general AI alignment?
Thanks. This sounds like a more peripheral interest/concern, compared to Eliezer/LW's, which was more like, we have to fully solve DT before building AGI/ASI, otherwise it could be catastrophic due to something like the AI falling prey to an acausal threat or commitment races, or can't cooperate with other AIs.
Some evidence about this: Eliezer was deliberating holding off on publishing TDT to use it as a test of philosophical / FAI research competence. He dropped some hints on LW (I think mostly that it had to do with Newcomb or cooperating in one-shot PD, and of course people knew that it had to do with AI) and also assigned MIRI (then SIAI) people to try to guess/reproduce his advance, and none of the then-SIAI people figured out what he had in mind or got very close until I posted about UDT (which combined my guess of Eliezer's idea with some of my own and other discussions on LW at the time, mainly from Vladmir Nesov).
Also, although I was separately interested in AI safety and decision theory, I didn't connect the dots between the two until I saw Eliezer's hints. I had investigated proto-updateless ideas to bypass difficulties in anthropic reasoning, and by the time Eliezer dropped his hints I had mostly given up on anyone being interested in my DT ideas. I also didn't think to question what I saw as the conventional/academic wisdom, that Defecting in one-shot PD is rational, as is two-boxing in NP.
So my guess is that while some people might have eventually come up with something like UDT even without Eliezer, it probably would have been seen as just one DT idea among many (e.g. SIAI people were thinking in various different directions, Gary Drescher who was independently trying to invent a one-boxing/cooperating DT had came up with a bunch of different ideas and remained unconvinced that UDT was the right approach), and also decision theory itself was unlikely to have been seen as central to AI safety for a time.
Ok, this changes my mental picture a little (although it's not very surprising that there would be some LW-influenced people at the labs privately still thinking/talking about decision theory). Any idea (or can you ask next time) how they feel about decision theory seemingly far from being solved, and their top bosses seemingly unaware or not concerned about this, or this concern being left out of all official communications?
Are people at the major AI companies talking about it privately? I don't think I've seen any official communications (e.g. papers, official blog posts, CEO essays) that mention it, so from afar it looks like decision theory has dropped off the radar of mainstream AI safety.
In retrospect it seems like such a fluke that decision theory in general and UDT in particular became a central concern in AI safety. In most possible worlds (with something like humans) there is probably no Eliezer-like figure, or the Eliezer-like figure isn't particularly interested in decision theory as a central part of AI safety, or doesn't like UDT in particular. I infer this from the fact that where Eliezer's influence is low (e.g. AI labs like Anthropic and OpenAI) there seems little interest in decision theory in connection with AI safety (cf Dario Amodei's recent article which triggered this reflection), and in other places interested in decision theory, that aren't downstream of Eliezer popularizing it, like academic philosophy, there's little interest in UDT.
If this is right, it's another piece of inexplicable personal "luck" from my perspective, i.e., why am I experiencing a rare timeline where I got this recognition/status.
How does the Shareholder Value Revolution fit into your picture? From an AI overview:
1. The Intellectual Origins (The 1970s)
The revolution was born out of economic stagnation in the 1970s. As U.S. corporate profits dipped and competition from Japan and Germany rose, economists and theorists argued that American managers had become "fat and happy," running companies for their own comfort rather than efficiency.
Two key intellectual pillars drove the change:
- Milton Friedman (The Moral Argument): In a famous 1970 New York Times essay, Friedman argued, "The social responsibility of business is to increase its profits." He posited that executives spending money on "social causes" (like keeping inefficient plants open to save jobs) were essentially stealing from the owners (shareholders).
- Jensen and Meckling (The Economic Argument - "Agency Theory"): In 1976, these economists published a paper describing the "Principal-Agent Problem." They argued that managers (agents) were not aligned with shareholders (principals). Managers wanted perks (corporate jets, large empires), while shareholders wanted profit. The solution? Align their interests by paying executives in stock.
It seems to better fit my normative picture of human values: terminal values come from philosophy, and subservience of instrumental values to terminal values improves over time as we get better at it, without need to permanently raise instrumental values to terminal status or irreversibly commingle the two.
Possible root causes if we don't end up having a good long term future (i.e., realize most of the potential value of the universe), with illustrative examples:
Is this missing anything, or perhaps not a good way to break down the root causes? The goal for this includes:
A less-spooky solution might involve the principal simply asking the agent to write a comprehensive guide to building a truly friendly AGI which would be aligned with human values in a way that was robustly good, then follow that guide (with the corrigible agent’s help) to produce an aligned, superintelligent sovereign.
Please take a look at A Conflict Between AI Alignment and Philosophical Competence (especially the last paragraph, about corrigibility), which is in part a reaction to this.
The striking contrast between Jan Leike, Jan 22, 2026:
and Scott Alexander, Feb 02, 2026:
I'm surprised not to see more discussions about how to update on alignment difficulty in light of Moltbook.[1] One seemingly obvious implication is that AI companies' alignment approaches are far from being robust to distribution shifts, even at the (not quite) human intelligence level, against shifts that are pretty easy to foresee ("you are a lobster" and being on AI social media). (Scott's alternative "they're just roleplaying" explanation doesn't seem viable or isn't exclusive with this one as I doubt AI companies' alignment training and auditing would have a deliberate exception for "roleplaying evil".)
There's a LW post titled Moltbook and the AI Alignment Problem but it seems unrelated to the question I'm interested in here.