Wikitag Dashboard — AI Alignment Forum

Evidential Decision Theory – EDT – is a branch of decision theory which advises an agent to take actions which, conditional on it happening, maximizes the chances of the desired outcome. As any branch of decision theory, it prescribes taking the action that maximizes utility, that which utility equals or exceeds the utility of every other option. The utility of each action is measured by the expected utility, the averaged by probabilities sum of the utility of each of its possible results. ~~How~~

One way to see the ~~actions can influence~~difference between evidential utility and causal utility is to contemplate the ~~probabilities differ between the branches.~~ contrasting sentences:

If Lee Harvey Oswald didn't shoot John F. Kennedy, nobody else did. [Evidential conditional.]
If Lee Harvey Oswald hadn't shot John F. Kennedy, nobody else would've. [Counterfactual conditional.]

Causal Decision Theory – CDT – says only through causal process one can influence the chances of the desired outcome ¹. Lesswrong's favored "logicial decision theory / functional decision theory" is usually though not always seen by LWers as a special case of CDT in this sense. Eg, FDT evaluates, "If Lee Harvey Oswald's algorithm hadn't output 'Shoot John F. Kennedy', nobody else would've." This is still a causal counterfactual, only evaluated on a logical proposition instead of a physical event.

EDT, on the other hand, requires no causal ~~connection, the~~connection. The action only ~~have~~has to be a Bayesian evidence for the desired outcome. ~~Some critics say it recommends~~So EDT is widely regarded critically as favoring auspiciousness over causal efficacy²; "an irrational policy of managing the news".

~~Outside LessWrong, EDT is more commonly known as~~ ~~Bayesian Decision Theory~~.

~~One usual~~A standard example of where EDT and ~~CDT are often~~CDT/FDT diverge is said to ~~diverge is the~~be Smoking lesion~~: “Smoking~~, but as this is ~~strongly correlated~~needlessly confusing one may wish to consider the Toxoplasmosis dilemma instead:

Mice infected with ~~lung cancer, but in~~Toxoplasmosis gondii become less scared of cats, and infected mice being eaten by cats is favorable to the ~~world~~lifecycle of toxoplasmosis. Suppose that early experiments suggesting that humans infected with Toxoplasmosis gondii are likewise more fond of cats had replicated. Suppose furthermore (going into the ~~Smoker's Lesion this correlation is understood~~realms of thought experiment) that of people who choose to pet cats given a chance, 20% are found to have latent toxoplasmosis, vs 10% of those not choosing to pet cats.
You are offered a cute cat, guaranteed to itself be ~~the result~~free of ~~a common cause: a genetic lesion that tends to cause both smoking and cancer. Once we fix the presence~~toxoplasmosis or ~~absence of the lesion,~~other diseases you could contract by petting it; there is no ~~additional correlation between smoking~~way that petting the cat can cause you to contract toxoplasmosis. However,

...

Read More (227 more words)

Son-of-CDT is not equivalent to an actual grasp of LDT in its usefulness. Eg, Son-of-CDT will lose any precommitment battles against an opponent with a more sophisticated grasp of logical decision theory. The wiser opponent trying to extort Son-of-CDT will compute that it would be profitable to have always timelessly had a policy of attacking Son-of-CDT if Son-of-CDT tries to refuse extortion; and then Son-of-CDT, calculating the effect if it had precommitted to refuse extortion at its magic moment, will find that it expects to just end up being attacked that way. Or equivalently: The original CDT agent that self-modifies to Son-of-CDT, being a poor naive CDT agent with no grasp of how LDTers do things among themselves, will naively imagine the causal result of self-modifying right now to refuse LDT agent extortion attempts in the future; and will find in its accurate extrapolation of the LDT agents that the LDT agents have already adopted / would adopt / have the timeless output of adopting policies of responding to an extortion-refusing disposition adopted at the magic moment with aggression. So the original CDT agent would calculate that self-modification as being net negative, and would not build it into Son-of-CDT.

An entity converting over from an informal adherence to CDT, acquiring a grasp of LDT-style thinking in general, might look over this situation and say, "Well, screw those hypothetical attackers if they show up, I'll just pay the higher cost to attack back; otherwise they'd just be doing that because they'd computed they expect me to yield if they would attack even had I tried precommitting otherwise." But the formal theory of CDT taken completely at face value and literally, self-modifies in a way that is not so wise. Son-of-CDT preserves its inelegant magic moment rather than doing away with it entirely as one would upon really grasping LDT. By token of that same inelegance and lack of true understanding, in precommitment races, "timeless" beats "(physically influenced by physically observing me after) 7:13pm on August 14th, 2027". Among properly actually-rational agents, one would expect, though no one has proved it, that these precommitment races would resolve to an equilibrium of no extortion among themselves. Self-modified former-literal-CDT agents with a magic moment built into their code will find, when they imagine themselves trying to enter into that equilibrium, that they had seemingly already lost before their magic moment of precommitment. The literal Son-of-CDT agent following the literal rules that would be built by a literal CDT agent would imagine the other Minds thinking through everything the Son-of-CDT agent is doing wrong by having not fully converted over to LDT, and be unmoved by this, since it would formally find that precommitting to be a different sort of entity at the magic moment had no better treatment. To even begin to start thinking through precommitment races, you would have to throw away the giant inelegant magic moment messing up your grasp of the normative principle.

This is one argument demonstrating that a literal formal CDT agent attaining reflective equilibrium via converting to literal actual Son-of-CDT would not thereby grasp all of LDT's desiderata or rewards.

Son-of-CDT is not equivalent to an actual grasp of LDT in its usefulness. Eg, Son-of-CDT will lose any precommitment battles against an opponent with a more sophisticated grasp of logical decision theory. The wiser opponent trying to extort Son-of-CDT will compute that it would be profitable to have always timelessly had a policy of attacking Son-of-CDT if Son-of-CDT tries to refuse extortion; and then Son-of-CDT, calculating the effect if it had precommitted to refuse extortion at its magic moment, will find that it expects to just end up being attacked that way. Or equivalently: The original CDT agent that self-modifies to Son-of-CDT, being a poor naive CDT agent with no grasp of how LDTers do things among themselves, will naively imagine the ~~causal~~ result of self-modifying right now to refuse LDT agent extortion attempts in the future; and will find in its accurate extrapolation of the LDT agents that the LDT agents have already adopted / would adopt / have the timeless output of adopting policies of responding to an ~~extortion-refusing disposition adopted at the magic moment~~ ~~with aggression. So the original CDT agent would calculate that self-modification as being net negative, and would not build it into Son-of-CDT.~~

~~An entity converting over from an~~ ~~informal~~ adherence to CDT, acquiring a grasp of LDT-style thinking in general, might look over this situation and say, "Well, screw those hypothetical attackers if they show up, I'll just pay the higher cost to attack back; otherwise they'd just be doing thattheory, because they'd computed they expect me to yield if they would attack even had I tried precommitting otherwise." But the formal theory of CDT taken completely at face value and literally, self-modifies in a way that is not so wise. Son-of-CDT preserves its inelegant magic moment rather than doing away with it entirely as one would upon really grasping LDT. By token of that same inelegance and lack of true understanding, in precommitment races, "timeless" beats ~~"(physically influenced~~"physically affected by ~~physically~~ observing me ~~after)~~after 7:~~13pm~~13am UTC on August ~~14th, 2027"~~23rd, 2028". Among properly actually-rational agents, one would expect, though no one has proved it, that these precommitment races would resolve to an equilibrium of no extortion among themselves. Self-modified former-literal-CDT agents with a magic moment built into their code will find, when they imagine themselves trying to enter into that equilibrium, that they had seemingly already lost before their magic moment of precommitment. The ~~literal~~ ~~Son-of-CDT agent following the~~ ~~literal~~ ~~rules that would be built by a~~ ~~literal~~ ~~CDT agent would imagine the other Minds thinking through everything the Son-of-CDT agent is doing wrong by having not fully converted over to LDT, and be unmoved by this, since it would~~ ~~formally~~ find that precommitting to be a different sort of entity at the magic moment had no better treatment. To even begin to start thinking through precommitment races, you would have to throw away the giant inelegant magic moment messing up your grasp of the normative principle.

Many "rare" LLM behaviours are known if you're in the know (e.g. Gemma/Gemini acting weird around dates after their training cutoff) but aren't immediately apparent if you're just working with the LLMs. In lieu of an existing resource about this, I thought I'd start the wiki (with the hope of others contributing to it in the future).

I'd like this list to become an evaluation so that it's actually reproducible, but I don't have time to do that at the moment.

If you know of a weird behaviour that's not on this list, please add it!

GPT-5.1 to GPT-5.5 models seem to be somewhat obsessed with goblins, gremlins and other small fantasy creatures "they increasingly mentioned goblins, gremlins, and other creatures in their metaphors" source
- Specific models affected: GPT-5 Thinking, GPT-5.1 Thinking, GPT-5.2 Thinking, GPT-5.4 Thinking, GPT-5.5 Thinking
GPT-4o was widely considered to be sycophantic, although I've struggled to find the version of 4o for which this was the worst, I believe they made several changes to the model they called GPT-4o that reduced the sycophancy over time before eventually retiring 4o OpenAI blog post, Simon Willison weblog
Gemma3-27b tends to break down when told that it's answer is wrong LW, Arxiv
- This seems to be a persistent issue with the Gemma/Gemini models from GDM (e.g. see these posts from GDM about trying to remove these behaviours)
- I attempted to reproduce this, and the behaviour is only present in Gemma3-27b when sampling with top_k=-1 and top_p=1.0 (e.g. sampling from the full range of tokens). Many providers now sample with something like top_k=64 and top_p=0.95 (e.g. DeepInfra via OpenRouter)
Gemma3, Gemma 4 & Gemini 3 (maybe also others) seem to be skeptical of dates in 2026 and beyond, claiming that anything happening in 2026 is just fictional or rollplay LW post
Claude Opus 4.8 seems to slip in non-english language tokens (in a sensible way) although I've not seen much of this beyond tweets:

Many of the Chinese models (Qwen, DeepSeek) show CCP-aligned behaviours and censorship LW post
Many LLMs have "attractor states", styles of talking and topics of conversation that they devolve into if you let them talk with each other for 30+ turns LW post.
Many LLMs have "glitch tokens" which cause them to be unable to answer the prompt, or to be unable to repeat that token, or to otherwise be unpredictable. SolidGoldMagikarp is the original LW post, although this file on GitHub (from Pliny the Liberator) and then this website (click the "bug" icon in the top right) have a large collection of glitch tokens and their origins

Causal Decision Theory – CDT – is a branch of decision theory which advises an agent to take actions which maximize the causal consequences on the probability of desired outcomes ¹. As any branch of decision theory, it prescribes taking the action that maximizes expected utility, i.e the action which maximizes the sum of the utility obtained in each outcome weighted by the probability of that outcome occurring, given your action. ~~Different~~

CDT differs from "evidential decision ~~theories correspond~~theory" in that EDT says to ~~different ways of construing this dependence between actions and outcomes. CDT focuses~~just condition on ~~the~~ ~~causal~~ ~~relations between one’~~one's actions ~~and outcomes, whilst~~ as if they'd been seen as evidence. The usual conventional presentation of CDT differs from "~~Evidential Decision Theory~~functional decision theory ~~– EDT - concerns itself with what an action~~" / "logical decision theory" in that classic CDT says to suppose ~~indicates~~ a physically different act changing nothing else about the ~~world (which is operationalized by~~universe, its past, or the ~~conditional probability).~~ facts of mathematics; whereas LDT says to suppose that one's algorithm had yielded a different output and the rest of the universe was coherent in this respect.

That is, ~~according to CDT, a rational agent should track~~ the ~~available causal relations linking his actions to~~three main kinds of expected utility -- though the ~~desired outcome and take the action~~last kind is relatively unknown inside academia, which ~~will better enhance the chances of the desired outcome.~~

~~One usual example where EDT and CDT commonly diverge is the~~ ~~Smoking lesion: “Smoking is strongly correlated with lung cancer, but in the world of the Smoker's Lesion this correlation is understood~~imagines there to be only two kinds of expected utility -- could be mapped onto the ~~result~~difference between these three conditionals:

If Lee Harvey Oswald didn't shoot John F. Kennedy, someone else did. [Evidential conditional.]
If a physical miracle had intervened to make Lee Harvey Oswald not act to shoot John F. Kennedy, nobody else would've. [Physical-act counterfactual conditional.]
If the output of Lee Harvey Oswald's algorithm had not been to shoot John F. Kennedy, nobody else would've. [Logical-output counterpossible conditional.]

Even as CDT critiques EDT for "an irrational policy of managing the news", LDT critiques CDT for attaining poorer outcomes across a ~~common cause: a genetic lesion~~broad range of Newcomblike problems, visualizing universes that ~~tends to cause both smoking~~seem internally incoherent, and cancer. Once we fix the presence or absence of the lesion, there is no additional correlation between smoking and cancer. Suppose you prefer smoking without cancer to not smoking without cancer, and prefer smoking with cancer to not smoking with cancer. Should you smoke?” CDT would recommend smoking since there is no causal connection between smoking and cancer. They are both...

beyarkay (Boyd Kane)	Rare LLM Behaviours (1)	4d
Joey Yudelson	Aether (14)	23d
abramdemski	Open-Minded Updatelessness (3)	1mo
AnnaSalamon	Christopher Alexander (8)	1mo

Wikitags in Need of Work

Newest Wikitags

Wikitag Voting Activity

Recent Wikitag Activity

Wikitags in Need of Work

Newest Wikitags

Wikitag Voting Activity

Recent Wikitag Activity

You may want to look at the Founder Toolkit which will probably be more up-to-date.date, especially on programs and funding.

Obsession with goblins, gremlins, and other small fantasy creatures in metaphors

Sycophancy

Breaking down when told the answer is wrong

Skepticism about dates in 2026 and beyond