james.lucassen

Sorted by New

# Wiki Contributions

I think so. But I'd want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.

Ok this is going to be messy but let me try to convey my hunch for why randomization doesn't seem very useful.

- Say I have an intervention that's helpful, and has a baseline 1/4 probability. If I condition on this statement, I get 1 "unit of helpfulness", and a 4x update towards manipulative AGI.
- Now let's say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1/4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutually exclusive. Which means that if I'd done no simulation at all, I would've gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes.
- Ok, well, what if my four interventions 1/8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1/8, p(O | manipulative) = 1/4, so I get a 2x update towards manipulative AGI. This is the same as if I'd just conditioned on the statement "one of my four interventions happens", and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.

Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn't doing anything different from just using a weaker search condition - it gives up bits of search, and so it has to pay less.

I'm pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on "no sims" isn't "weird Butlerian religion that still studies AI alignment", it's something more like "deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world".

In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of probability from our current world, those futures will tend to have bigger chunks of their probability coming from other pasts. Some of these are safe, like the Butlerian one, but I wouldn't be surprised if they were almost always dangerous.

Making a worst-case assumption, I want to only simulate worlds that are decently probable given today's state, which makes me lean more towards trying to implement HCH.

• Honeypots seem like they make things strictly safer, but it seems like dealing with subtle defection will require a totally different sort of strategy. Subtle defection simulations are infohazardous - we can't inspect them much because info channels from a subtle manipulative intelligence to us are really dangerous. And assuming we can only condition on statements we can (in principle) identify a decision procedure for, figuring out how to prevent subtle defection from arising in our sims seems tricky.
• The patient research strategy is a bit weird, because the people we're simulating to do our research for us are counterfactual copies of us - they're probably going to want to run simulations too. Disallowing simulation as a whole seems likely to lead to weird behavior, but maybe we just disallow using simulated research to answer the whole alignment problem? Then our simulated researchers can only make more simulated researchers to investigate sub-questions, and eventually it bottoms out. But wait, now we're just running Earth-scale HCH (ECE?)
• Actually, that could help deal with the robustness-to-arbitrary-conditions and long-time-spans problems. Why don't we just use our generative model to run HCH?

Proposed toy examples for G:

• G is "the door opens", a- is "push door", a+ is "some weird complicated doorknob with a lock". Pretty much any b- can open a-, but only a very specific key+manipulator combo opens a+. a+ is much more informative about successful b than a- is.
• G is "I make a million dollars", a- is "straightforward boring investing", a+ is "buy a lottery ticket". A wide variety of different world-histories b can satisfy a-, as long as the markets are favorable - but a very narrow slice can satisfy a+. a+ is a more fragile strategy (relative to noise in b) than a- is.

I don't think I understand how the scorecard works. From:

[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.

And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?

If the scorecard is learned, then it needs a training signal from Steering. But if it's useless at the start, it can't provide a training signal. On the other hand, since the "ontology" of the Learning subsystem is learned-from-scratch, then it seems difficult for a hard-coded scorecard to do this translation task.