I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can't keep dropping in probability forever, since evidence is lost. The probability only becomes small - but this means if you run for long enough you do in fact expect the transition.
LLMs are high order Markov models, meaning they can't really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.
I disagree. The crux of the matter is the limited memory of an LLM. If the LLM had unlimited memory, then every Luigi act would further accumulate a little evidence against Waluigi. But because LLMs can only update on so much context, the probability drops to a small one instead of continuing to drop to zero. This makes waluigi inevitable in the long run.
So to see if I have this right, the difference is I'm trying to point at a larger phenomenon and you mean teleosemantics to point just at the way beliefs get constrained to be useful.
This doesn't sound quite right to me. Teleosemantics is a purported definition of belief. So according to the teleosemantic picture, it isn't a belief if it's not trying to accurately reflect something.
The additional statement I prefaced this with, that accuracy is an instrumentally convergent subgoal, was intended to be an explanation of why this sort of "belief" is a common phenomenon, rather than part of the definition of "belief".
In principle, there could be a process which only optimizes accuracy and doesn't serve any larger goal. This would still be creating and maintaining beliefs according to the definition of teleosemantics, although it would be an oddity. (How did it get there? How did a non-agentic process end up creating it?)
FIAT (by another name) was previously proposed in the book On Intelligence. The version there had a somewhat predictive-processing-like story where the cortex makes plans by prediction alone; so reflective agency (really meaning: agency arising from the cortex) is entirely dependent on building a self-model which predicts agency. Other parts of the brain are responsible for the reflexes which provide the initial data which the self-model gets built on (similar to your story).
The continuing kick toward higher degrees of agency comes from parts of the brain which have reactions to the predictions made by the cortex. (Otherwise, the cortex just learns to predict the raw reflexes, and we're stuck imitating our baby selves or something along those lines).
It's not clear precisely how all of that works, but basically it means we have a pure predictive system (and much of the time we simply take the predicted actions), plus we have some other stuff (EG reflexes, and an override RLish system which inhibits and/or replaces the predicted action under some circumstances).
The most obvious version of FIAT which someone might write down after reading your post, otoh, is more like: run some IRL technique on your own past actions, and then (most of the time) plan based on the inferred goals, again with some overrides (built-in reflexes).
Here's my attempt to make a probably-false prediction from FIAT, as best I can.
It seems like the thing to do is to look for cases where people pursue their own goals, rather than the goals they would predict they have based on past actions.
It needs to be complex enough to not plausibly be a reflex/instinct.
A sort of plausible example is courtship. It's complex, it can't easily be inferred from previous things you did (not the first time you do it, that is), and it agentically orients toward a goal. The problem is, I think it's well-explained as imitation - "I'm a person; the people around me do this and seem really into it; so I infer that I'm really into it too".
So it's got to be a case where someone does something unexpected, even to themselves, which they don't see people do, but which achieves goals-they-plausibly-had-in-hindsight.
Homosexual intercourse in the 1800s??
Christopher Thomas Knight heading off into the woods??
One thing I see as different between your perspective and (my understanding of) teleosemantics, so far:
You make a general case that values underlie beliefs.
Teleosemantics makes a specific claim that the meaning of semantic constructs (such as beliefs and messages) is pinned down by what it is trying to correspond to.
Your picture seems very compatible with, EG, the old LW claim that UDT's probabilities are really a measure of caring - how much you care about doing well in a variety of scenarios.
Teleosemantics might fail to analyze such probabilities as beliefs at all; certainly not beliefs about the world. (Perhaps beliefs about how important different scenarios are, where "importance" gets some further analysis...)
The teleosemantic picture is that epistemic accuracy is a common, instrumentally convergent subgoal; and "meaning" (in the sense of semantic content) arises precisely where this subgoal is being optimized.
That's my guess at the biggest difference between our two pictures, anyway.
OK. So far it seems to me like we share a similar overall take, but I disagree with some of your specific framings and such. I guess I'll try and comment on the relevant posts, even though this might imply commenting on some old stuff that you'll end up disclaiming.
(Following some links...) What's the deal with Holons?
Your linked article on epistemic circularity doesn't really try to explain itself, but rather links to this article, which LOUDLY doesn't explain itself.
I haven't read much else yet, but here is what I think I get:
Not something you wrote, but Viliam trying to explain you:
There is an "everything of everythings", exceeding all systems, something like the highest level Tegmark multiverse only much more awesome, which is called "holon", or God, or Buddha. We cannot approach it in far mode, but we can... somehow... fruitfully interact with it in near mode. Rationalists deny it because their preferred far-mode approach is fruitless here. But you can still "get it" without necessarily being able to explain it by words. Maybe it is actually inexplicable by words in principle, because the only sufficiently good explanation for holon/God/Buddha is the holon/God/Buddha itself. If you "get it", you become the Kegan-level-5 meta-rationalist, and everything will start making sense. If you don't "get it", you will probably construct some Kegan-level-4 rationalist verbal argument for why it doesn't make sense at all.
I'm curious whether you see any similarity between holons and object oriented ontology (if you're at all familiar with that).
I was vibing with object oriented ontology when I wrote this, particularly the "nontrivial implication" at the end.
Here's my terrible summary of OOO:
I find OOO to be an odd mix of interesting ideas and very weird ideas.
Feel free to ignore the OOO comparison if it's not a terribly useful comparison for holons.