Posts

Sorted by New

Wiki Contributions

Comments

The Speed + Simplicity Prior is probably anti-deceptive

I think you might be conflating two different scenarios?

I present alternative strategy for a mesaoptimizer that, yes, wasn't in the post, by I don't see why?

Is your claim that that’s basically nothing?

Yes, or at least it approaches relatively nothing as we get more competent optimizer.

If it’s always super easy to identify the base objective and then optimize it, then shouldn’t need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective.

If it doesn't store it, it isn't - it would be able to derive that humans want it, but wouldn't want to optimize it itself.

The Speed + Simplicity Prior is probably anti-deceptive

I think speed and complexity trade off against model's precision? Like, for simple objectives you can hardcode it, but for complex one you wouldn't have space for base objective that you couldn't spend on table of chemical elements or something. So I would expect t(mesaoptimize_env) for the base objective to be only slighter greater than t(is_deployed_in_world) + t(find(get_base_obj(env))) + t(mesaoptimize_env) for a mesaoptimizer that hardcodes a check, but computes base objective, because I expect t(find(get_base_obj(env))) to be mostly contained in t(mesaoptimize_env). Hmmm, except, maybe if you somehow specifically penalize difference in speed for instrumental vs terminal objective? But then mesaoptimizer would just find other instrumental ways.

Late 2021 MIRI Conversations: AMA / Discussion

It was all very interesting, but what was the goal of these discussions? I mean I had an impression that pretty much everyone assigned >5% probability to "if we scale we all die" so it's already enough reason to work on global coordination on safety. Is the reasoning that the same mental process that assigned too low probability would not be able to come up with actual solution? Or something like "at the time they think their solution reduced probability of failure from 5% to 0.1% it would still be much higher"? Seems to be only possible if people don't understand arguments about inner optimisators or what not, as opposed to disagreeing with them.

Response to "What does the universal prior actually look like?"

To clarify, sufficient observations would still falsify all "simulate simple physics, start reading from simple location" programs and eventually promote "simulate true physics, start reading from camera location"?