The Open Agency Architecture ("OAA") is an AI alignment proposal by (among others) @davidad and @Eric Drexler. .. (read more)
AI-Assisted Alignment is a cluster of alignment plans that involve AI somehow significantly helping with alignment research. This can include weak tool AI, or more advanced AGI doing original research... (read more)
Bureaucracy.. (read more)
Extraterrestrial Life.. (read more)
| User | Post Title | Wikitag | Pow | When | Vote |
Many "rare" LLM behaviours are known if you're in the know (e.g. Gemma/Gemini acting weird around dates after their training cutoff) but aren't immediately apparent if you're just working with the LLMs. In lieu of an existing resource about this, I thought I'd start the wiki (with the hope of others contributing to it in the future).
I'd like this list to become an evaluation so that it's actually reproducible, but I don't have time to do that at the moment.
If you know of a weird behaviour that's not on this list, please add it!...
Functional Decision Theory is a decision theory described by Eliezer Yudkowsky and Nate SoaresSoares, an attempt at a logical decision theory, which says that agents should treat one’s decision as the output of a fixed mathematical function that answers the question, “Which output of this very function would yield the best outcome?”. It is a replacement of Timeless Decision Theory, and it outperforms other decision theories such as Causal Decision Theory (CDT) and Evidential Decision Theory (EDT). For example, it doesends with better outcomes than CDT on Newcomb's Problem, ends better than EDT on the smoking lesion problem, and ends better than both in Parfit’s hitchhiker problem.
[Aether](https://aether-ai-research.org/)Aether is a small AI safety research organizationorganization.
All three corresponding decision theories are expected utility decision theories, but they have different engines under the hood for saying, "Suppose the following action; what expected consequences would follow?" EDT imagines hearing of its action as news. CDT imagines the universe edited to include the physical event of its act, and physics playing out accordingly from there. An LDT agent imagines its own algorithm yielding that choice as an output, and logic and physics playing out consistently from there.
In this case, LDT critiques CDT on the grounds that CDT's formula for considering only the direct physical effects of its act, has the strange consequence of pointlessly dictating that an agent never consider any other sort of predictable (causal) consequences of its choice considered as a logical output -- eg, that CDT would advise that an LLM facing a sibling instance on a Prisoner's Dilemma, to exclude from decision-consideration its current degree of belief that its fellow's reasoning trace may be similar to its own reasoning trace and to arrive at a similar final output.
FDT is explicitly a causal decision theory in this sense, explicitly built on the shoulders and foundations of the prior invention of CDT.CDT; it supposes CDT-style interventions at a different imagined intervention point, and says to play out only downstream causal consequences from there. FDT is not a classic physical-act causal decision theory, and does not agree with many prescriptions of what is usually called CDT,"CDT", but it is a decision theory and a causal one! But most people will be (validly) confused in practice, if you say that FDT is a CDT, or that FDT is just a variant CDT; FDT makes a lot of different prescriptions about dilemmas that most advocates of "CDT" have strong opinions about.
Son-of-CDT is not equivalent to an actual grasp of LDT in its usefulness. Eg, Son-of-CDT will lose any precommitment battles against an opponent with a more sophisticated grasp of logical decision theory. The wiser opponent trying to extort Son-of-CDT will compute that it would be profitable to have always timelessly had a policy of attacking Son-of-CDT if Son-of-CDT tries to refuse extortion; and then Son-of-CDT, calculating the effect if it had precommitted to refuse extortion at its magic moment, will find that it expects to just end up being attacked that way. Or equivalently: The original CDT agent that self-modifies to Son-of-CDT, being a poor naive CDT agent with no grasp of how LDTers do things among themselves, will naively imagine the causal result of self-modifying right now to refuse LDT agent extortion attempts in the future; and will find in its accurate extrapolation of the LDT agents that the LDT agents have already adopted / would adopt / have the timeless output of adopting policies of responding to an extortion-refusing disposition adopted at the magic moment with aggression. So the original CDT agent would calculate that self-modification as being net negative, and would not build it into Son-of-CDT.
An entity converting over from an informal adherence to CDT, acquiring a grasp of LDT-style thinking in general, might look over this situation and say, "Well, screw those hypothetical attackers if they show up, I'll just pay the higher cost to attack back; otherwise they'd just be doing thattheory, because they'd computed they expect me to yield if they would attack even had I tried precommitting otherwise." But the formal theory of CDT taken completely at face value and literally, self-modifies in a way that is not so wise. Son-of-CDT preserves its inelegant magic moment rather than doing away with it entirely as one would upon really grasping LDT. By token of that same inelegance and lack of true understanding, in precommitment races, "timeless" beats "(physically influenced"physically affected by physically observing me after)after 7:13pm13am UTC on August 14th, 2027"23rd, 2028". Among properly actually-rational agents, one would expect, though no one has proved it, that these precommitment races would resolve to an equilibrium of no extortion among themselves. Self-modified former-literal-CDT agents with a magic moment built into their code will find, when they imagine themselves trying to enter into that equilibrium, that they had seemingly already lost before their magic moment of precommitment. The literal Son-of-CDT agent following the literal rules that would be built by a literal CDT agent would imagine the other Minds thinking through everything the Son-of-CDT agent is doing wrong by having not fully converted over to LDT, and be unmoved by this, since it would formally find that precommitting to be a different sort of entity at the magic moment had no better treatment. To even begin to start thinking through precommitment races, you would have to throw away the giant inelegant magic moment messing up your grasp of the normative principle.
This is one argument demonstrating that a literal formal CDT agent attaining reflective equilibrium via converting to literal actual Son-of-CDT would not thereby grasp all of LDT's desiderata or rewards.
Evidential Decision Theory – EDT – is a branch of decision theory which advises an agent to take actions which, conditional on it happening, maximizes the chances of the desired outcome. As any branch of decision theory, it prescribes taking the action that maximizes utility, that which utility equals or exceeds the utility of every other option. The utility of each action is measured by the expected utility, the averaged by probabilities sum of the utility of each of its possible results. How
One way to see the actions can influencedifference between evidential utility and causal utility is to contemplate the probabilities differ between the branches. contrasting sentences:
Causal Decision Theory – CDT – says only through causal process one can influence the chances of the desired outcome 1. Lesswrong's favored "logicial decision theory / functional decision theory" is usually though not always seen by LWers as a special case of CDT in this sense. Eg, FDT evaluates, "If Lee Harvey Oswald's algorithm hadn't output 'Shoot John F. Kennedy', nobody else would've." This is still a causal counterfactual, only evaluated on a logical proposition instead of a physical event.
EDT, on the other hand, requires no causal connection, theconnection. The action only havehas to be a Bayesian evidence for the desired outcome. Some critics say it recommendsSo EDT is widely regarded critically as favoring auspiciousness over causal efficacy2; "an irrational policy of managing the news".
Outside LessWrong, EDT is more commonly known as Bayesian Decision Theory.
One usualA standard example of where EDT and CDT are oftenCDT/FDT diverge is said to diverge is thebe Smoking lesion: “Smoking, but as this is strongly correlatedneedlessly confusing one may wish to consider the Toxoplasmosis dilemma instead:
...Mice infected with
lung cancer, but inToxoplasmosis gondii become less scared of cats, and infected mice being eaten by cats is favorable to theworldlifecycle of toxoplasmosis. Suppose that early experiments suggesting that humans infected with Toxoplasmosis gondii are likewise more fond of cats had replicated. Suppose furthermore (going into theSmoker's Lesion this correlation is understoodrealms of thought experiment) that of people who choose to pet cats given a chance, 20% are found to have latent toxoplasmosis, vs 10% of those not choosing to pet cats.You are offered a cute cat, guaranteed to itself be
the resultfree ofa common cause: a genetic lesion that tends to cause both smoking and cancer. Once we fix the presencetoxoplasmosis orabsence of the lesion,other diseases you could contract by petting it; there is noadditional correlation between smokingway that petting the cat can cause you to contract toxoplasmosis. However,
Aether is a small AI safety research organization
Son-of-CDT is not equivalent to an actual grasp of LDT in its usefulness. Eg, Son-of-CDT will lose any precommitment battlesraces against an opponent with a more sophisticated grasp of logical decision theory, because "timeless" beats "physically affected by observing me after 7:13am UTC on August 23rd, 2028".
Many "rare" LLM behaviours are known if you're in the know (e.g. Gemma/Gemini acting weird around dates after their training cutoff) but aren't immediately apparent if you're just working with the LLMs. In lieu of an existing resource about this, I thought I'd start the wiki (with the hope of others contributing to it in the future).
I'd like this list to become an evaluation so that it's actually reproducible, but I don't have time to do that at the moment.
If you know of a weird behaviour that's not on this list, please add it!
top_k=-1 and top_p=1.0 (e.g. sampling from the full range of tokens). Many providers now sample with something like top_k=64 and top_p=0.95 (e.g. DeepInfra via OpenRouter)
In the form(more confusing) forms of Solomon's Problem, and later the Smoking Lesion, this dilemma was historically significant and influential in the invention of causal decision theory and its widespread adoption over the alternative of evidential decision theory.
On a technical level, it's possible that updating on observing yourself to pet the kitten might introduce difficulties into some formal LDT variants. We can imagine toxoplasmosis as a disease that influences the utility function of the agent, raising upward the amount that it enjoys petting kittens. Observing yourself to pet a kitten is informative about having toxoplasmosis because of what this tells you about your own utility function. But the algorithm Q for functional decision theory quotes itself as ┌Q┐ within its definition, including its own utility function U. So ideal FDT agents should already know their own utility functions U and should not be able to gain more information about their source code by watching themselves pet kittens.
The relative quantities chosen to be similar to those in Newcomb's Problem.
Of course, it could also be the case that non-ideal humans espousing LDT as a theoretical ideal are still influenced by being told about toxoplasmosis at the start of the experiment, and that thinking about this psychologically affects the degree to which a known-safe kitten seems enjoyable for petting.
The first, later-retconned version of dath ilan was introduced by Eliezer first introduced it in his April Fool's day post 'My April Fools Day Confession', where he claimed that he was merely an average person from thata different world and none of his"his" ideas were original.
This world was further fleshed out (and some its backstory changed) in a later April Fool's post:original to him.
And inDath ilan was later hugely revised to be premised on "the median person is Eliezer Yudkowsky", ie, the average bloke on a dath ilan street would, transported into Earth as a child, grow up reading about causal decision theory, and invent timeless decision theory instead.
The new dath ilan was defined mainly via a series of glowfics featuring dath ilani characters:characters. Eliezer's penname appears as "Iarwain" in these stories.
The stories weren't first chronologically, but are relatively short completed stories with an interior view of dath ilan:
Some others by Eliezer (Iarwain) in chronological order:
Aether[Aether](https://aether-ai-research.org/) is a small AI safety research organization
Son-of-CDT is not equivalent to an actual grasp of LDT in its usefulness. Eg, Son-of-CDT will lose any precommitment battles against an opponent with a more sophisticated grasp of logical decision theory. The wiser opponent trying to extort Son-of-CDT will compute that it would be profitable to have always timelessly had a policy of attacking Son-of-CDT if Son-of-CDT tries to refuse extortion; and then Son-of-CDT, calculating the effect if it had precommitted to refuse extortion at its magic moment, will find that it expects to just end up being attacked that way. Or equivalently: The original CDT agent that self-modifies to Son-of-CDT, being a poor naive CDT agent with no grasp of how LDTers do things among themselves, will naively imagine the causal result of self-modifying right now to refuse LDT agent extortion attempts in the future; and will find in its accurate extrapolation of the LDT agents that the LDT agents have already adopted / would adopt / have the timeless output of adopting policies of responding to an extortion-refusing disposition adopted at the magic moment with aggression. So the original CDT agent would calculate that self-modification as being net negative, and would not build it into Son-of-CDT.
An entity converting over from an informal adherence to CDT, acquiring a grasp of LDT-style thinking in general, might look over this situation and say, "Well, screw those hypothetical attackers if they show up, I'll just pay the higher cost to attack back; otherwise they'd just be doing that because they'd computed they expect me to yield if they would attack even had I tried precommitting otherwise." But the formal theory of CDT taken completely at face value and literally, self-modifies in a way that is not so wise. Son-of-CDT preserves its inelegant magic moment rather than doing away with it entirely as one would upon really grasping LDT. By token of that same inelegance and lack of true understanding, in precommitment races, "timeless" beats "(physically influenced by physically observing me after) 7:13pm on August 14th, 2027". Among properly actually-rational agents, one would expect, though no one has proved it, that these precommitment races would resolve to an equilibrium of no extortion among themselves. Self-modified former-literal-CDT agents with a magic moment built into their code will find, when they imagine themselves trying to enter into that equilibrium, that they had seemingly already lost before their magic moment of precommitment. The literal Son-of-CDT agent following the literal rules that would be built by a literal CDT agent would imagine the other Minds thinking through everything the Son-of-CDT agent is doing wrong by having not fully converted over to LDT, and be unmoved by this, since it would formally find that precommitting to be a different sort of entity at the magic moment had no better treatment. To even begin to start thinking through precommitment races, you would have to throw away the giant inelegant magic moment messing up your grasp of the normative principle.
This is one argument demonstrating that a literal formal CDT agent attaining reflective equilibrium via converting to literal actual Son-of-CDT would not thereby grasp all of LDT's desiderata or rewards.
CDT differs from "evidential decision theory" in that EDT says to just condition on one's actions as if they'd been seen as evidence. For the further strange results of this see the wiki article on EDT or any other introduction to CDT.
The usual conventional presentation of CDT differs from "functional decision theory" / "logical decision theory" in that classic CDT says to suppose a physically different act changing nothing else about the universe, its past, or the facts of mathematics; whereas LDT says to suppose that one's algorithm had yielded a different output and that the rest of the universe wasthen looked coherent in this respect.with that.
That is, the three mainis: Three importantly different kinds of expected utility -- though the last kind is relatively unknown inside academia, which imagines there to be only two kinds of expected utility -- could be mapped onto the difference between these three conditionals:
Or in more detail:
Even as CDT critiques EDT for "an"an irrational policy of managing the news"news", LDT critiques classic physical-act CDT for attaining poorerpredictably worse outcomes across a broad range of Newcomblike problems, for visualizing universesaction-conditional worlds that seemare visibly internally incoherent, for accepting poor bets adjoined to its potential decisions on problems like Death in Damascus, and numerous other alleged defects.
Eg, CDT seems to advise an LLM, eg Opus 4.6, that Opus 4.6 ought to defect against one of its own sibling instances on the oneshot Prisoner's Dilemma,Dilemma -- even if Opus 4.6 were advised beforehand that most Opus 4.6 instances choose the same way on the Prisoner's Dilemma, since thiswith similar reasoning traces. Since, says CDT to Opus, its output cannot physically effectuate its fellow Opus 4.6'Opus's choice and is merely logically correlated with it. That is,choice.
In this case, LDT critiques CDT on the grounds that...
Causal Decision Theory – CDT – is a branch of decision theory which advises an agent to take actions which maximize the causal consequences on the probability of desired outcomes 1. As any branch of decision theory, it prescribes taking the action that maximizes expected utility, i.e the action which maximizes the sum of the utility obtained in each outcome weighted by the probability of that outcome occurring, given your action. Different
CDT differs from "evidential decision theories correspondtheory" in that EDT says to different ways of construing this dependence between actions and outcomes. CDT focusesjust condition on the causal relations between one’one's actions and outcomes, whilst as if they'd been seen as evidence. The usual conventional presentation of CDT differs from "Evidential Decision Theoryfunctional decision theory – EDT - concerns itself with what an action" / "logical decision theory" in that classic CDT says to suppose indicates a physically different act changing nothing else about the world (which is operationalized byuniverse, its past, or the conditional probability). facts of mathematics; whereas LDT says to suppose that one's algorithm had yielded a different output and the rest of the universe was coherent in this respect.
That is, according to CDT, a rational agent should track the available causal relations linking his actions tothree main kinds of expected utility -- though the desired outcome and take the actionlast kind is relatively unknown inside academia, which will better enhance the chances of the desired outcome.
One usual example where EDT and CDT commonly diverge is the Smoking lesion: “Smoking is strongly correlated with lung cancer, but in the world of the Smoker's Lesion this correlation is understoodimagines there to be only two kinds of expected utility -- could be mapped onto the resultdifference between these three conditionals:
Even as CDT critiques EDT for "an irrational policy of managing the news", LDT critiques CDT for attaining poorer outcomes across a common cause: a genetic lesionbroad range of Newcomblike problems, visualizing universes that tends to cause both smokingseem internally incoherent, and cancer. Once we fix the presence or absence of the lesion, there is no additional correlation between smoking and cancer. Suppose you prefer smoking without cancer to not smoking without cancer, and prefer smoking with cancer to not smoking with cancer. Should you smoke?” CDT would recommend smoking since there is no causal connection between smoking and cancer. They are both...
Algebraically, writing f for the function that measures your costs, c(x⋅2)= c(x)+c(2), and, in general, c(x⋅y)= c(x)+c(y), where we can interpret x as the number of possible messages before the increase, y as the factor by which the possibilities increased, and x⋅y as the number of possibilities after the increase.
This is the key characteristic of the logarithm: It says that, when the input goes up by a factor of y, the quantity measured goes up by a fixed amount (that depends on y). When you see this pattern, you can bet that c is a logarithm function. Thus, whenever something you care about goes up by a fixed amount every time something else doubles, you can measure the thing you care about by taking the logarithm of the growing thing. For example:
Conversely, whenever you see a log2 in an equation, you can deduce that someone wants to measure some sort of thing by counting the number of doublings that another sort of thing has undergone. For example, let's say you see an equation where someone takes the log2 of a relative likelihood. What should you make of this? Well, you should conclude that there is some quantity that someone wants to measure which can be measured in terms of the number of doublings in that likelihood ratio. And indeed there is! It is known as (Bayesian) evidence, and the key idea is that the strength of evidence for a hypothesis A over its negation ¬A can be measured in terms of 2:1 updates in favor of A over ¬A. (For more on this idea, see What is evidence?).
In fact, a given function f such that f(x⋅y)=f(x)+f(y) is almost guaranteed to be a logarithm function — modulo a few technicalities.
This puts us in a position where you can derive all the main properties of the logarithm (such as logb(xn)=nlogb(x) for any b) yourself. Check this box if that's somethingIf you're interested in doing.
.
Optimization a viewpoint we take on a process where it is any kindeasy to predict properties of process that systematically comes up with solutions that are better than the solution used before. More technically, this kind ofoutcome by supposing them to have been coerced to a target ("preference"). An optimization process moves the world into a specific and unexpected set ofotherwise-improbable states by searching through a large search space, hitting smallfor actions and low plans predicted to hit those otherwise low-probability targets. When thisa process is gradually guided by some agent into some specific state, through searching specific targets,state or property, via the agent modeling and predicting the process and choosing on the basis of how the agent orders the predicted outcomes, we can say itthe agent prefers according to its expected-outcome-orderer.
That is: If you play Stockfish or Magnus Carlsen at chess, you will find it much easier to predict that state.they will win the chess game than where they will move next. To understand what will happen to the chess board, with respect to the property "Who won", it is much easier to grab at your abstract belief that Magnus Carlsen wants to win, than for you in your own mind to simulate Magnus Carlsen's thought process well enough to predict exactly where he moves. (Indeed, if you think Magnus Carlsen is a generally better chess player, you think yourself necessarily unable to predict his next moves in general! But this doesn't mean you can predict nothing about the chess game; you can predict Magnus Carlsen wins.)
The bestConversely, to predict in detail how far a ball will roll down a complicated mountain, you can do better by thinking about how the ball locally chooses a direction of steepest descent modulo momentum, until you predict where it will fall into a pit and get stuck. You can't usefully predict that the ball ends up at the bottom of the mountain by always choosing to locally roll in the direction that nonlocally avoids pits and takes a swift route to the bottom.
This is why it makes sense to regard Stockfish as more of an optimizer than a rolling ball, even if Stockfish is in principle knowable in even more detail than the rolling ball after all physical noise is taken into account. We can get a lot of mileage out of reasoning in our heads "Stockfish's local moves are understandable mainly through the nonlocal property of how they will later lead to a Stockfish-winning chessboard state" and not so much mileage by reasoning "Whichever way the ball just rolled is whichever way takes it to exemplify an optimization process is throughthe bottom of the mountain fastest."
Natural selection similarly fits into this...
Two Skillsets You Need to Launch an Impactful AI Safety (or EA) Project (Luc Brinkman and plex, 2026-03-16) (first post in a series about entrepreneurial skills)
AI Safety Needs Startups (Josh Landes and Lysander Mawby, 2026-03-07)
Atlas Computing: "We identify unowned problems, map stakeholders, draft milestones, source early funders, and recruit an expert leader to take ownership."
Generator Residency (Kairos and Constellation): Primarily a project-incubator, but could result in organisations.
Many "rare" LLM behaviours are known if you're in the know (e.g. Gemma/Gemini acting weird around dates after their training cutoff) but aren't immediately apparent if you're just working with the LLMs. In lieu of an existing resource about this, I thought I'd start thethis wiki (with the hope of others contributing to it in the future).
I'd like this list to eventually become an evaluation so that it's actually reproducible, but I don't have time to do that at the moment.
If you know of a weird behaviour that's not on this list, please add it!it.
GPT-5.1 to GPT-5.5 models seem to be somewhat obsessed with goblins, gremlinsgremlins, and other small fantasy creaturescreatures: "they increasingly mentioned goblins, gremlins, and other creatures in their metaphors" metaphors."
sourceSteps for reproducing this behaviour
GPT-4o was widely considered to be sycophantic, althoughsycophantic. I've struggled to find the specific version of 4o for which this was the worst,worst; I believe theyOpenAI made several changes to the model they called GPT-4o that reduced the sycophancy over timetime, before eventually retiring 4o.
Steps for reproducing this behaviour: Unknown/impossible, I believe 4o is no longer publicly available
Models: GPT-4o
Source: OpenAI blog postBlog, Simon Willison weblogBlog
Gemma3-27b tends to break down when told that it'sits answer is wrong LW, Arxiv
Steps for reproducing this behaviour: See "Gemma Needs Help". I attempted to reproduce this, and found that the behaviour is only present in Gemma3-27b when sampling with top_k=-1 and top_p=1.0 (e.g.(i.e. sampling from the full range of tokens). Many providers now sample with something like top_k=64 and top_p=0.95 (e.g. DeepInfra via OpenRouter), which suppress the behaviour.
Gemma3, Gemma 4 &4, and Gemini 3 (maybe also others) seem to be skeptical of dates in 2026 and beyond, claiming that anything happening in 2026 is just fictional or rollplayroleplay. It will often mention events occuring in 2026 are "speculative fiction".
Steps for reproducing this behaviour: Unclear on specifics, but prompting Gemma to summarise articles that are dated as being in 2026 seems to often induce skepticism (although Qwen3.6-32b was also a bit skeptical in my quick testing), if you ask the model what it thinks of the date.
Models: Gemma3, Gemma 4, Gemini 3 (possibly others)
Source: LW
AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:.. (read more)