Wikitag Dashboard — AI Alignment Forum

Wikitags in Need of Work

Many "rare" LLM behaviours are known if you're in the know (e.g. Gemma/Gemini acting weird around dates after their training cutoff) but aren't immediately apparent if you're just working with the LLMs. In lieu of an existing resource about this, I thought I'd start the wiki (with the hope of others contributing to it in the future)... (read more)

Aether is a small AI safety research organization.

Updateful decision theories change the probability distribution used to evaluate actions over time. Updateless Decision Theory (UDT) does not, instead always maximizing a priori expected utility. This works well in well-defined decision problems with high probability in the prior, but in some senses, does not learn. Open-Minded Updatelessness seeks to combine the advantages of updatelessness and updatefulness by allowing some changes in probabilities, without fully updating on evidence.

Christopher Alexander (1936-2022) was an architect who studied the way nature and traditionally built buildings (such as peasant huts, or cathedrals) are a particular kind of beautiful, and have (he argued) the ability to bring a person back into a sense of perspective (e.g., a person may be quite stressed out about some detail, and then go for a long walk in nature, and find themselves "coming back to themselves.") Alexander attempted to work out a theory of design (for buildings, but also for design work broadly) that would create houses and other built objects with this same sort of beauty and sense of perspective embedded in them. His work inspired the "design patterns" movement in computer science, and, indirectly, wikis.

Once upon a time, LessWrong was a place where you'd be told to Read The Sequences before you'd finished your second comment apologizing for wasting time with non-constructive praise in the first.

Although the culture of rationality has changed greatly since the olden days, and our teachings are dispersed in many offspring movements, some remember good old-fashioned rationality with the same nostalgia MIRI computer scientists long for the days when AI was a science.

Keltham

Suffering-focused ethics (SFE) is a family of moral views that give priority to reducing suffering, especially intense suffering. Rather than treating happiness and suffering as fully symmetric, suffering-focused views hold that preventing severe suffering often matters more urgently than creating additional happiness.

Indexical uncertainty is irreducible subjective uncertainty induced by anthropically expecting to be in more than one possible future world... (read more)

Free will is one of the easiest hard questions, as millennia-old philosophical dilemmas go. Yudkowsky has suggested that aspiring reductionists should try to solve it on their own in advance of reading the LessWrong analysis... (read more)

pronoun-resolution

Newest Wikitags

Wikitag Voting Activity

Recent Wikitag Activity

Wikitags in Need of Work

Newest Wikitag

Wikitag Voting Activity

Combined Wikitags Activity Feed

Wikitags in Need of Work

Reset Filter Collapse Wikitags

All Wikitags

Aether is a small AI safety research organization.

Keltham

Indexical uncertainty is irreducible subjective uncertainty induced by anthropically expecting to be in more than one possible future world... (read more)

pronoun-resolution

Newest Wikitags

Wikitag Voting Activity

Recent Wikitag Activity

I'd like this list to become an evaluation so that it's actually reproducible, but I don't have time to do that at the moment.

If you know of a weird behaviour that's not on this list, please add it!

GPT-5.1 to GPT-5.5 models seem to be somewhat obsessed with goblins, gremlins and other small fantasy creatures "they increasingly mentioned goblins, gremlins, and other creatures in their metaphors" source
- Specific models affected: GPT-5 Thinking, GPT-5.1 Thinking, GPT-5.2 Thinking, GPT-5.4 Thinking, GPT-5.5 Thinking
GPT-4o was widely considered to be sycophantic, although I've struggled to find the version of 4o for which this was the worst, I believe they made several changes to the model they called GPT-4o that reduced the sycophancy over time before eventually retiring 4o OpenAI blog post, Simon Willison weblog
Gemma3-27b tends to break down when told that it's answer is wrong LW, Arxiv
- This seems to be a persistent issue with the Gemma/Gemini models from GDM (e.g. see these posts from GDM about trying to remove these behaviours)
- I attempted to reproduce this, and the behaviour is only present in Gemma3-27b when sampling with top_k=-1 and top_p=1.0 (e.g. sampling from the full range of tokens). Many providers now sample with something like top_k=64 and top_p=0.95 (e.g. DeepInfra via OpenRouter)
Gemma3, Gemma 4 & Gemini 3 (maybe also others) seem to be skeptical of dates in 2026 and beyond, claiming that anything happening in 2026 is just fictional or rollplay LW post
Claude Opus 4.8 seems to slip in non-english language tokens (in a sensible way) although I've not seen much of this beyond tweets:

Many of the Chinese models (Qwen, DeepSeek) show CCP-aligned behaviours and censorship LW post
Many LLMs have "attractor states", styles of talking and topics of conversation that they devolve into if you let them talk with each other for 30+ turns LW post.
Many LLMs have "glitch tokens" which cause them to be unable to answer the prompt, or to be unable to repeat that token, or to otherwise be unpredictable. SolidGoldMagikarp is the original LW post, although this file on GitHub (from Pliny the Liberator) and then this website (click the "bug" icon in the top right) have a large collection of glitch tokens and their origins

Son-of-CDT is not equivalent to an actual grasp of LDT in its usefulness. Eg, Son-of-CDT will lose any precommitment battles against an opponent with a more sophisticated grasp of logical decision theory. The wiser opponent trying to extort Son-of-CDT will compute that it would be profitable to have always timelessly had a policy of attacking Son-of-CDT if Son-of-CDT tries to refuse extortion; and then Son-of-CDT, calculating the effect if it had precommitted to refuse extortion at its magic moment, will find that it expects to just end up being attacked that way. Or equivalently: The original CDT agent that self-modifies to Son-of-CDT, being a poor naive CDT agent with no grasp of how LDTers do things among themselves, will naively imagine the causal result of self-modifying right now to refuse LDT agent extortion attempts in the future; and will find in its accurate extrapolation of the LDT agents that the LDT agents have already adopted / would adopt / have the timeless output of adopting policies of responding to an extortion-refusing disposition adopted at the magic moment with aggression. So the original CDT agent would calculate that self-modification as being net negative, and would not build it into Son-of-CDT.

An entity converting over from an informal adherence to CDT, acquiring a grasp of LDT-style thinking in general, might look over this situation and say, "Well, screw those hypothetical attackers if they show up, I'll just pay the higher cost to attack back; otherwise they'd just be doing that because they'd computed they expect me to yield if they would attack even had I tried precommitting otherwise." But the formal theory of CDT taken completely at face value and literally, self-modifies in a way that is not so wise. Son-of-CDT preserves its inelegant magic moment rather than doing away with it entirely as one would upon really grasping LDT. By token of that same inelegance and lack of true understanding, in precommitment races, "timeless" beats "(physically influenced by physically observing me after) 7:13pm on August 14th, 2027". Among properly actually-rational agents, one would expect, though no one has proved it, that these precommitment races would resolve to an equilibrium of no extortion among themselves. Self-modified former-literal-CDT agents with a magic moment built into their code will find, when they imagine themselves trying to enter into that equilibrium, that they had seemingly already lost before their magic moment of precommitment. The literal Son-of-CDT agent following the literal rules that would be built by a literal CDT agent would imagine the other Minds thinking through everything the Son-of-CDT agent is doing wrong by having not fully converted over to LDT, and be unmoved by this, since it would formally find that precommitting to be a different sort of entity at the magic moment had no better treatment. To even begin to start thinking through precommitment races, you would have to throw away the giant inelegant magic moment messing up your grasp of the normative principle.

This is one argument demonstrating that a literal formal CDT agent attaining reflective equilibrium via converting to literal actual Son-of-CDT would not thereby grasp all of LDT's desiderata or rewards.

Son-of-CDT is not equivalent to an actual grasp of LDT in its usefulness. Eg, Son-of-CDT will lose any precommitment battles against an opponent with a more sophisticated grasp of logical decision theory. The wiser opponent trying to extort Son-of-CDT will compute that it would be profitable to have always timelessly had a policy of attacking Son-of-CDT if Son-of-CDT tries to refuse extortion; and then Son-of-CDT, calculating the effect if it had precommitted to refuse extortion at its magic moment, will find that it expects to just end up being attacked that way. Or equivalently: The original CDT agent that self-modifies to Son-of-CDT, being a poor naive CDT agent with no grasp of how LDTers do things among themselves, will naively imagine the ~~causal~~ result of self-modifying right now to refuse LDT agent extortion attempts in the future; and will find in its accurate extrapolation of the LDT agents that the LDT agents have already adopted / would adopt / have the timeless output of adopting policies of responding to an ~~extortion-refusing disposition adopted at the magic moment~~ ~~with aggression. So the original CDT agent would calculate that self-modification as being net negative, and would not build it into Son-of-CDT.~~

~~An entity converting over from an~~ ~~informal~~ adherence to CDT, acquiring a grasp of LDT-style thinking in general, might look over this situation and say, "Well, screw those hypothetical attackers if they show up, I'll just pay the higher cost to attack back; otherwise they'd just be doing thattheory, because they'd computed they expect me to yield if they would attack even had I tried precommitting otherwise." But the formal theory of CDT taken completely at face value and literally, self-modifies in a way that is not so wise. Son-of-CDT preserves its inelegant magic moment rather than doing away with it entirely as one would upon really grasping LDT. By token of that same inelegance and lack of true understanding, in precommitment races, "timeless" beats ~~"(physically influenced~~"physically affected by ~~physically~~ observing me ~~after)~~after 7:~~13pm~~13am UTC on August ~~14th, 2027"~~23rd, 2028". Among properly actually-rational agents, one would expect, though no one has proved it, that these precommitment races would resolve to an equilibrium of no extortion among themselves. Self-modified former-literal-CDT agents with a magic moment built into their code will find, when they imagine themselves trying to enter into that equilibrium, that they had seemingly already lost before their magic moment of precommitment. The ~~literal~~ ~~Son-of-CDT agent following the~~ ~~literal~~ ~~rules that would be built by a~~ ~~literal~~ ~~CDT agent would imagine the other Minds thinking through everything the Son-of-CDT agent is doing wrong by having not fully converted over to LDT, and be unmoved by this, since it would~~ ~~formally~~ find that precommitting to be a different sort of entity at the magic moment had no better treatment. To even begin to start thinking through precommitment races, you would have to throw away the giant inelegant magic moment messing up your grasp of the normative principle.

If Lee Harvey Oswald didn't shoot John F. Kennedy, somebody else did.
If Lee Harvey Oswald hadn't shot John F. Kennedy, nobody else would've.
If Lee Harvey Oswalds wouldn't shoot John F. Kennedy, he'd ~~still be alive.~~have lived longer.

If we heard that Lee Harvey Oswald didn't shoot John F. Kennedy, we would conclude that someone else must have done it. [Evidential conditional.]
If we conduct mental surgery on our model of physics to imagine Lee Harvey Oswald not act to shoot John F. Kennedy, but then imagine physics proceeding normally from there, we imagine that Kennedy would ~~still be alive.~~have lived longer. [Physical-act counterfactual conditional.]
If we imagine the output of Lee Harvey Oswald's logical algorithm in all places it is instantiated, being "do not shoot John F. Kennedy", and imagine the rest of the universe looking consistent with that, we imagine Kennedy ~~still being alive.~~living longer. [Logical-output counterfactual conditional; "counterpossible".]

All three corresponding decision theories are expected utility decision theories, but they have different engines under the hood for saying, "Suppose the following action; what expected consequences would follow?" EDT imagines hearing of its action as news. CDT imagines the universe edited to include the physical event of its act, and physics playing out accordingly from there. An LDT agent imagines its own algorithm yielding that choice as an output, and logic and physics playing out consistently from there.

In this case, LDT critiques CDT on the grounds that CDT's formula for considering only the direct physical effects of its act, has the strange consequence of pointlessly dictating that an agent never consider any other sort of predictable (causal) consequences of its choice considered as a logical output -- eg, that CDT would advise ~~that~~ an LLM facing a sibling instance on a Prisoner's Dilemma, to exclude from decision-consideration its current degree of belief that its fellow's reasoning trace may be similar to its own reasoning trace and to arrive at a similar final output.

FDT is explicitly a causal decision theory in this sense, explicitly built on the shoulders and foundations of the prior invention of ~~CDT.~~CDT; it supposes CDT-style interventions at a different imagined intervention point, and says to play out only downstream causal consequences from there. FDT is not a classic physical-act causal decision theory, and does not agree with many prescriptions of what is usually called ~~CDT,~~"CDT", but it is a decision theory and a causal one! But most people will be (validly) confused in practice, if you say that FDT is a CDT, or that FDT is just a variant CDT; FDT makes a lot of different prescriptions about dilemmas that most advocates of "CDT" have strong opinions about.

Optimization a viewpoint we take on a process where it is ~~any kind~~easy to predict properties of ~~process that systematically comes up with solutions that are better than~~ the ~~solution used before. More technically, this kind of~~outcome by supposing them to have been coerced to a target ("preference"). An optimization process moves the world into ~~a specific and unexpected set of~~otherwise-improbable states by searching ~~through a large search space, hitting small~~for actions and ~~low~~ plans predicted to hit those otherwise low-probability targets. When ~~this~~a process is ~~gradually~~ guided by some agent into some specific ~~state, through searching specific targets,~~state or property, via the agent modeling and predicting the process and choosing on the basis of how the agent orders the predicted outcomes, we can say itthe agent prefers according to its expected-outcome-orderer.

That is: If you play Stockfish or Magnus Carlsen at chess, you will find it much easier to predict that ~~state.~~they will win the chess game than where they will move next. To understand what will happen to the chess board, with respect to the property "Who won", it is much easier to grab at your abstract belief that Magnus Carlsen wants to win, than for you in your own mind to simulate Magnus Carlsen's thought process well enough to predict exactly where he moves. (Indeed, if you think Magnus Carlsen is a generally better chess player, you think yourself necessarily unable to predict his next moves in general! But this doesn't mean you can predict nothing about the chess game; you can predict Magnus Carlsen wins.)

~~The best~~Conversely, to predict in detail how far a ball will roll down a complicated mountain, you can do better by thinking about how the ball locally chooses a direction of steepest descent modulo momentum, until you predict where it will fall into a pit and get stuck. You can't usefully predict that the ball ends up at the bottom of the mountain by always choosing to locally roll in the direction that nonlocally avoids pits and takes a swift route to the bottom.

This is why it makes sense to regard Stockfish as more of an optimizer than a rolling ball, even if Stockfish is in principle knowable in even more detail than the rolling ball after all physical noise is taken into account. We can get a lot of mileage out of reasoning in our heads "Stockfish's local moves are understandable mainly through the nonlocal property of how they will later lead to a Stockfish-winning chessboard state" and not so much mileage by reasoning "Whichever way the ball just rolled is whichever way takes it to ~~exemplify an optimization process is through~~the bottom of the mountain fastest."

Natural selection similarly fits into this...

beyarkay (Boyd Kane)	Rare LLM Behaviours (1)	4d
Joey Yudelson	Aether (14)	23d
abramdemski	Open-Minded Updatelessness (3)	26d
AnnaSalamon	Christopher Alexander (8)	1mo

Wikitags in Need of Work

Newest Wikitags

Wikitag Voting Activity

Recent Wikitag Activity

Wikitags in Need of Work

Newest Wikitags

Wikitag Voting Activity

Recent Wikitag Activity

You may want to look at the Founder Toolkit which will probably be more up-to-date.date, especially on programs and funding.

Obsession with goblins, gremlins, and other small fantasy creatures in metaphors

Sycophancy

Breaking down when told the answer is wrong

Skepticism about dates in 2026 and beyond