There was a recent in-depth post on reward hacking by @Kei (e.g. referencing this) who might have to say more about this question.
Though I also wanted to just add a quick comment about this part:
Now, it’s possible that, during o3’s RL CoT post-training, it got certain questions correct by lying and cheating. If so, that would indeed be reward hacking. But we don’t know if that happened at all.
It is not quite the same, but something that could partly explain lying is if models get the same amount of reward during training, e.g. 0, for a "wrong" solution as they get for saying something like "I don't know". Which would then encourage wrong solutions insofar as they at least have a potential of getting reward occasionally when the model gets the expected answer "by accident" (for the wrong reasons). At least something like that seems to be suggested by this:
The most frequent failure mode among human participants is the inability to find a correct solution. Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated [reasoning] LLMs consistently claimed to have solved the problems [even if they didn't]. This discrepancy poses a significant challenge for mathematical applications of LLMs as mathematical results derived using these models cannot be trusted without rigorous human validation.
Source: Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
Disclaimer: I haven't read the Logical Induction paper, which may explain my lack of intuition.
Is there maybe a way to explain your theory of Judgement more directly, just in epistemic terms of belief and probability theory, without falling back to instrumental sounding trading analogies around buying and selling stuff for some price? Similar to the Radical Probabilism post perhaps? (Though that one also mentioned some instrumental arguments like money pumps / Dutch book arguments.)
Isn't honesty a part of the "truth machine" rather than the "good machine"? Confabulation seems to be a case of the model generating text which it doesn't "believe", in some sense.
What does it mean to optimize for the map to fit the territory, but not the other way around? (After all: we can improve fit between map and territory by changing either map or territory.) Maybe it's complicated, but primarily what it means is that the map is the part that's being selected in the optimization. When communicating, I'm not using my full agency to make my claims true; rather, I'm specifically selecting the claims to be true.
I don't know whether you are familiar with it, but most speech acts or writing acts are considered to have either a "word-to-world" direction of fit, e.g. statements, or a "world-to-word" direction of fit, e.g. commands. Only with the former the agents optimize the speech act ("word") to fit the world; in the latter case they optimize the world to fit the speech act. The fit would be truth in the case of a statement, execution in the case of a command.
There is an analogous but more basic distinction for intentional states ("propositional attitudes"), where the "intentionality" of a mental state is its aboutness. Some have a mind-to-world direction of fit, e.g. beliefs, while others have a world-to-mind direction of fit, e.g. desires or intentions. The former are satisfied when the mind is optimized to fit the world, the latter when the world is optimized to fit the mind.
(Speech acts seem to be honest only insofar the speaker/writer holds an analogous intentional state. So someone who states that snow is white is only honest if they believe that snow is white. For lying the speaker would, apart from being dishonest, also need a deceptive intention with the speech act, i.e. intenting the listener to believe that the speaker believes that snow is white.)
So it seems in the above paragraph you are only considering the word-to-world / mind-to-world direction of fit?
It's interesting to note that we can still get Aumann's Agreement Theorem while abandoning the partition assumption (see Ignoring ignorance and agreeing to disagree, by Dov Samet). However, we still need Reflexivity and Transitivity for that result. Still, this gives some hope that we can do without the partition assumption without things getting too crazy.
I don't quite get this paragraph. Do you suggest that the failure of Aumanns disagreement theorem would be "crazy"? I know his result has become widely accepted in some circles (including, I think, LessWrong) but
a) the conclusion of the theorem is highly counterintuitive, which should make us suspicious, and
b) it relies on Aumann's own specific formalization of "common knowledge" (mentioned under "alternative accounts" in SEP) which may very well be fatally flawed and not be instantiated in rational agents, let alone in actual ones.
It has always baffled me that some people (including economists and LW style rationalists) celebrate a result which relies on the, as you argued, highly questionable, concept of common knowledge, or at least one specific formalization of it.
To be clear, rejecting Aumann's account of common knowledge would make his proof unsound (albeit still valid), but it would not solve the general "disagreement paradox", the counterintuitive conclusion that rational disagreements seem to be impossible: There are several other arguments which lead this conclusion, and which do not rely on any notion of common knowledge. (Such as this essay by Richard Feldman, which is quite well-known in philosophy and which makes only very weak assumptions.)
I think the most dangerous version of 3 is a sort of Chesterton's fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn't require holding on to the mistaken belief.
Do you have an example for 4? It seems rather abstract and contrived.
Generally, I think the value of believing true things tends to be almost always positive. Examples to the contrary seem mostly contrived (basilisk-like infohazards) or only occur relatively rarely. (E.g. believing a lie makes you more convincing, as you don't technically have to lie when telling the falsehood, but lying is mostly bad or not very good anyway.)
Overall, I think the risks from philosophical progress aren't overly serious while the opportunities are quite large, so the overall EV looks comfortably positive.