I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space.
In the traditional deceptive mesaoptimization story, the model needs to at some point switch from "pursuing objective X directly" to "pursuing objective Y indirectly by deceptively pursuing objective X". I agree that, if there isn't really a core "mesaoptimizer" that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for "pursuing objective Y".
I'm not sure whether you're arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you're arguing that deceptive mesaoptimization isn't likely in the first place.
What would it want to optimize, then, according to you? I’m claiming that “I want to figure out what the environment wants and then do it” is a simpler goal than “I want X, to get that I’m going to figure out what the environment wants and then do it”
Re using both, you’re right, if you make your other assumptions then using both could work. (Though it does still have higher simplicity cost.)
I think you might be conflating two different scenarios? t(is_deployed_in_world) and t(find(get_base_obj)) are not simultaneously present in either case.
That said, I'll still do my best to respond. By definition, t(find(get_base_obj(env))) is whatever part of t(mesaoptimize_env) that Hfh doesn't have to do. Is your claim that that's basically nothing? If it's always super easy to identify the base objective and then optimize it, then Hfh shouldn't need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective. In that case, both Hfd and Hsd still lose due to needing to pay for c(simple_objective) and, respectively, c(is_deployed_in_world) or t(mesaoptimize_outer_env).
I'm not saying these costs are exorbitant, but they are strictly greater than not paying them - so the prior would never incentivize paying them, and always slightly disincentivize doing so. (I.e. they'd never arise naturally from optimization pressure alone.)
I want to push back on your "can't make an unbreakable wall" metaphor. We have an unbreakable wall like that today where two super-powerful beings are just hanging out sharing earth; it's called the survivable nuclear second-strike capability.
(For clarity, here I'll assume that aligned AGI-cohort A and unaligned AGI-cohort B have both FOOMed and have nanotech.) There isn't obviously an infinite amount of energy available for B to destroy every last trace of A. This is just like how in our world, neither the US nor Russia have enough resources to have certainty that they could destroy all of their opponents' nuclear capabilities in a first strike. If any of the Americans' nuclear capabilities survive a Russian first strike, those remaining American forces' objective switches from "uphold the constitution" to "destroy the enemy no matter the cost, to follow through on tit-for-tat". Humans are notoriously bad at this kind of precommitment-to-revenge-amid-the-ashes-of-civilization, but AGIs/their nanotech can probably be much more credible.
Note the key thing here: once B attempts to destroy A, A is no longer "bound" by the constraints of being an aligned agent. Its objective function switches to being just as ruthless (or moreso) as B, and so raw post-first-strike power/intelligence on each side becomes a much more reasonable predictor of who will win.
If B knows A is playing tit-for-tat, and A has done the rational thing of creating a trillion redundant copies of itself (each of which will also play tit-for-tat) so they couldn't all be eliminated in one strike without prior detection, then B has a clear incentive not to pick a fight it is highly uncertain it can win.
One counterargument you might have: maybe offensive/undetectable nanotech is strictly favored over defensive/detection nanotech. If you assign nontrivial probability to the statement: "it is possible to destroy 100% of a nanotech-wielding defender with absolutely no previously-detectable traces of offensive build-up, even though the defender had huge incentives to invest in detection", then my argument doesn't hold. I'd definitely be interested in your (or others') justification as to why.
Consequences of this line of argument:
Very interested to hear feedback! (/whether I should also put this somewhere else.)
One datapoint I really liked about this: https://arxiv.org/abs/2104.03113 (Scaling Laws for Board Games). They train AlphaGo agents of different sizes to compete on the game Hex.
The approximate takeaway, quoting the author: “if you are in the linearly-increasing regime [where return on compute is nontrivial], then you will need about 2× as much compute as your opponent to beat them 2/3 of the time.”
This might suggest that, absent additional asymmetries (like constraints on the aligned AIs massively hampering them), the win ratio may be roughly proportional to the compute ratio. If you assume we can get global data center governance, I’d consider that a sign in favor of the world’s governments. (Whether you think that’s good is a political stance that I believe folks here may disagree on.)
Bonus quote: “This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins3. In this toy model, doubling your compute doubles how many random numbers you draw, and the probability that you possess the largest number is 2/3. This suggests that the complex game play of Hex might actually reduce to each agent having a ‘pool’ of strategies proportional to its compute, and whoever picks the better strategy wins. While on the basis of the evidence presented herein we can only consider this to be serendipity, we are keen to see whether the same behaviour holds in other games.”
To clarify, this is intended to be a test-time objective; I'm assuming the system was trained in simulation and/or by observing the environment. In general, this reward wouldn't need to be "trained" – it could just be hardcoded into the system. If you're asking how the system would understand its reward without having experienced it already, I'm assuming that sufficiently-advanced AIs have the ability to "understand" their reward function and optimize on that basis. For example, "create two identical strawberries on the cellular level" can only be plausibly achieved via understanding, rather than encountering the reward often enough in simulation to learn from it, since it'd be so rare even in simulation.
Modern reinforcement learning systems receive a large positive reward (or, more commonly, an end to negative rewards) when ending the episode, and this incentivizes them to end the episode quickly (sometimes suicidally). If you only provide this "shutdown reward", I'd expect to see the same behavior, but only after a certain time period.