Ok this is going to be messy but let me try to convey my hunch for why randomization doesn't seem very useful.
- Say I have an intervention that's helpful, and has a baseline 1/4 probability. If I condition on this statement, I get 1 "unit of helpfulness", and a 4x update towards manipulative AGI.
- Now let's say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1/4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutually exclusive. Which means that if I'd done no simulation at all, I would've gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes.
- Ok, well, what if my four interventions 1/8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1/8, p(O | manipulative) = 1/4, so I get a 2x update towards manipulative AGI. This is the same as if I'd just conditioned on the statement "one of my four interventions happens", and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.
Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn't doing anything different from just using a weaker search condition - it gives up bits of search, and so it has to pay less.
I'm pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on "no sims" isn't "weird Butlerian religion that still studies AI alignment", it's something more like "deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world".
In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of probability from our current world, those futures will tend to have bigger chunks of their probability coming from other pasts. Some of these are safe, like the Butlerian one, but I wouldn't be surprised if they were almost always dangerous.
Making a worst-case assumption, I want to only simulate worlds that are decently probable given today's state, which makes me lean more towards trying to implement HCH.
Proposed toy examples for G:
I don't think I understand how the scorecard works. From:
[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.
And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?
If the scorecard is learned, then it needs a training signal from Steering. But if it's useless at the start, it can't provide a training signal. On the other hand, since the "ontology" of the Learning subsystem is learned-from-scratch, then it seems difficult for a hard-coded scorecard to do this translation task.
I think so. But I'd want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.