I'm confused by your Fast Deceptive Mesaoptimiser example. How does the speed prior come in here? It isn't mentioned in the description.

I think a better plan looks something like "You can't open source a system until you've determined and disclosed the sorts of threat models your system will enable, and society has implemented measures to become robust to these threat models. Once any necessary measures have been implemented, you are free to open-source." 

The problem with this plan is that it assumes that there are easy ways to robustify the world. What if the only proper defense against bioweapons is a complete monitoring of the entire internet? Perhaps this is something that we'd like to avoid. In this scenario, your plan would likely lead to someone coming up with a fake plan to robustify the world and then claim that it'd be fine for them to release their model as open-source, because people really want to do open-source.

For example, in your plan you write:

Then you set a reasonable time-frame for the vulnerability to be patched: In the case of SHA-1, the patch was "stop using SHA-1" and the time-frame for implementing this was 90 days.

This is exactly the kind of plan that I'm worried about. People will be tempted to argue that surely 4 years is enough time for the biodefense plan to be implemented, four years rolls around and it's clearly not in place, but then they push for release anyway.

I'll go into more detail later, but as an intuition pump imagine that: the best open source model is always 2 years behind the best proprietary model

You seem to have hypothesised what is to me an obviously unsafe scenario. Let's suppose our best proprietary models hit upon a dangerous bioweapon capability. Well, now we only have two years to prepare for it, regardless of whether this is completely wildly unrealistic. Worse, this occurs for each and every dangerous capability.

Will evaluators be able to anticipate and measure all of the novel harms from open source AI systems? Sadly, I’m not confident the answer is “yes,” and this is the main reason I only ~50% endorse this post.

When we're talking about risk management, a 50% chance that a key assumption will work out, when there isn't a good way to significantly reduce this uncertainty often doesn't translate into a 50% chance of it being a good plan, but rather a near 0% chance.

I'm still pretty skeptical of what would happen without explicit focus. The Bletchley Park declaration was a super vague and applause-lighty declaration, which fortunately mentions issues of control, but just barely. It's not clear to me yet that this will end up receiving much-dedicated focus.

Regarding biosecurity and cyber, my big worry here is open-source and it seems totally plausible that a government will pass mostly sensible regulation, then create a massive gaping hole where open-source regulation should be.

  • If I’d tried to assume that models are “trying” to get a low value for the training loss, I might have ended up relying on our ability to incentivize the model to make very long-term predictions. But I think that approach is basically a dead end.

Why do you believe that this is a dead-end?

Do phase transitions actually show up? So far, the places where theoretically predicted phase transitions are easiest to confirm are simplified settings like deep linear networks and toy models of superposition. For larger models, we expect phase transitions to be common but "hidden." Among our immediate priorities are testing just how common these transitions are and whether we can detect hidden transitions.


What do you mean by 'hidden"?

Thinking this through.

There's a lot of ways in which speedrunning is like paperclip maximisation: speedrunning doesn't contribute to society and further paperclips after we've produced a certain amount become useless.

I'm still confused by the analogy though. Because seems like a lot of people may do speedrunning for fun - but maybe you see it as more about status - while paperclip production isn't fun. I think this makes a difference though, as even though we don't want our society to produce absurd amounts of paperclips, we probably do want lots of niche ways to have fun.

Your comment focuses on GPT4 being "pretty good at extracting preferences from human data" when the stronger part of the argument seems to be that "it will also generally follow your intended directions, rather than what you literally said".

I agree with you that it was obvious in advance that a superintelligence would understand human value.

However, it sure sounded like you thought we'd have to specify each little detail of the value function. GPT4 seems to suggest that the biggest issue will be a situation where:

1) The AI has an option that would produce a lot of utility if you take one position on an exotic philosophical thought experiment and very little if you take the other side.
2) The existence of powerful AI means that the thought experiment is no longer exotic.

An alternative framing that might be useful: What do you see as the main bottleneck for people having better predictions of timelines (as you see it)?

Do you in fact think that having such a list is it?

What we need to find, for a given agent to be constrained by being a 'utility maximiser' is to consider it as having a member of a class of utility functions where the actions that are available to it systematically alter the expected utility available to it - for all utility functions within this class.

This sentence is extremely difficult for me to parse. Any chance you could clarify it?

In most situations, were these preferences over my store of dollars for example, this would seem to be outside the class of utility functions that would meaningfully constrain my action, since this function is not at all smooth over the resource in question.

Could you explain smoothness is typically required for meaningly constraining our actions?

Load More