Lukas Finnveden

Wiki Contributions

Comments

EDT with updating double counts

Interesting! Here's one way to look at this:

  • EDT+SSA-with-a-minimal-reference-class behaves like UDT in anthropic dilemmas where updatelessness doesn't matter.
  • I think SSA with a minimal reference class is roughly equivalent to "notice that you exist; exclude all possible worlds where you don't exist; renormalize"
  • In large worlds where your observations have sufficient randomness that observers of all kinds exists in all worlds, the SSA update step cannot exclude any world. You're updateless by default. (This is the case in the 99% example above.)
  • In small or sufficiently deterministic worlds, the SSA update step can exclude some possible worlds.
    • In "normal" situations, the fact that it excludes worlds where you don't exist doesn't have any implications for your decisions — because your actions will normally not have any effects in worlds where you don't exist.
    • But in situations like transparent newcombs, this means that you will now not care about non-existent copies of yourself.

Basically, EDT behaves fine without updating. Excluding worlds where you don't exist is one kind of updating that you can do that doesn't change your behavior in normal situations. Whether you do this or not will determine whether you act updateless in situations like transparent newcomb that happen in small or sufficiently deterministic worlds. (In large and sufficiently random worlds, you'll act updateless regardless.)

Viewed like this, the SSA part of EDT+SSA looks unnecessary and strange. Especially since I think you do want to act updateless in situations like transparent newscomb.

The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument

Re your edit: That bit seems roughly correct to me.

If we are in a simulation, SIA doesn't have strong views on late filters for unsimulated reality. (This is my question (B) above.) And since SIA thinks we're almost certainly in a simulation, it's not crazy to say that SIA doesn't have strong view on late filters for unsimulated reality. SIA is very ok with small late filters, as long as we live in a simulation, which SIA says we probably do.

But yeah, it is a little bit confusing, in that we care more about late-filters-in-unsimulated reality if we live in unsimulated reality. And in the (unlikely) case that we do, then we should ask my question (C) above, in which case SIA do have strong views on late filters.

The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument

I think it's important to be clear about what SIA says in different situations, here. Consider the following 4 questions:

A) Do we live in a simulation?

B) If we live in a simulation, should we expect basement reality to have a large late filter?

C) If we live in basement reality, should we expect basement reality (ie our world) to have a large late filter?

D) If we live in a simulation, should we expect the simulation (ie our world) to have a large late filter?

In this post, you persuasively argue that SIA answers "yes" to (A) and "not necessarily" to (B). However, (B) is almost never decision-relevant, since it's not about our own world. What about (C) and (D)? (Which are easier to see how they could be decision-relevant, for someone who buys SIA. I personally agree with you that something like Anthropic Decision Theory is the best way to reason about decisions, but responsible usage of SIA+CDT is one way to get there, in anthropic dilemmas.)

To answer (C): If we condition on living in basement reality, then SIA favors hypotheses that imply many observers in basement reality. The simulated copies are entirely irrelevant, since we have conditioned them away. (You can verify this with bayes theorem.) So we are back with the SIA doomsday argument again, and we face large late filters.

To answer (D): Detailed simulations of civilisations that spread to the stars are vastly more expensive than detailed simulations of early civilizations. This means that the latter are likely to be far more common, and we're almost certainly living in a simulation we're we'll never spread to the (simulated) stars. (This is plausibly because the simulation will be turned off before we get the chance.) You could discuss what terminology to use for this, but I'd be inclined to call this a large late filter, too.

So my preferred framing isn't really that the simulation hypothesis "undercuts" the SIA doomsday argument. It's rather that the simulation hypothesis provides one plausible mechanism for it: that we're in a simulation that will end soon. But that's just a question of framing/terminology. The main point of this comment is to provide answers to questions (C) and (D).

[AN #156]: The scaling hypothesis: a plan for building AGI
(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)

I'd like to know what this figure is based on. In the linked post, Gwern writes:

The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character⁠.

But in that linked post, there's no mention of "0.7" bits in particular, as far as I or cmd-f can see. The most relevant passage I've read is:

Claude Shannon found that each character was carrying more like 1 (0.6-1.3) bit of unguessable information (differing from genre to genre8); Hamid Moradi found 1.62-2.28 bits on various books9⁠; Brown et al 1992 found <1.72 bits; Teahan & Cleary 1996 got 1.46; Cover & King 1978 came up with 1.3 bits10⁠; and Behr et al 2002 found 1.6 bits for English and that compressibility was similar to this when using translations in Arabic/Chinese/French/Greek/Japanese/Korean/Russian/Spanish (with Japanese as an outlier). In practice, existing algorithms can make it down to just 2 bits to represent a character, and theory suggests the true entropy was around 0.8 bits per character.11

I'm not sure what the relationship is between supposedly unguessable information and human performance, but assuming that all these sources were actually just estimating human performance, and without looking into the sources more... this isn't just lots of uncertainty, but vast amounts of uncertainty, where it's very plausible that GPT-3 has already beaten humans. This wouldn't be that surprising, given that GPT-3 must have memorised a lot of statistical information about how common various words are, which humans certainly don't know by default.

I have a lot of respect for people looking into a literature like this and forming their own subjective guess, but it'd be good to know if that's what happened here, or if there is some source that pinpoints 0.7 in particular as a good estimate.

Decoupling deliberation from competition

Thanks, computer-speed deliberation being a lot faster than space-colonisation makes sense. I think any deliberation process that uses biological humans as a crucial input would be a lot slower, though; slow enough that it could well be faster to get started with maximally fast space colonisation. Do you agree with that? (I'm a bit surprised at the claim that colonization takes place over "millenia" at technological maturity; even if the travelling takes millenia, it's not clear to me why launching something maximally-fast – that you presumably already know how to build, at technological maturity – would take millenia. Though maybe you could argue that millenia-scale travelling time implies millenia-scale variance in your arrival-time, in which case launching decades or centuries after your competitors doesn't cost you too much expected space?)

If you do agree, I'd infer that your mainline expectation is that we succesfully enforce a worldwide pause before mature space-colonisation; since the OP suggests that biological humans are likely to be a significant input into the deliberation process, and since you think that the beaming-out-info schemes are pretty unlikely.

(I take your point that as far as space-colonisation is concerned; such a pause probably isn't strictly necessary.)

Decoupling deliberation from competition

I'm curious about how this interacts with space colonisation. The default path of efficient competition would likely lead to maximally fast space-colonisation, to prevent others from grabbing it first. But this would make deliberating together with other humans a lot trickier, since some space ships would go to places where they could never again communicate with each other. For things to turn out ok, I think you either need:

  • to pause before space colonisation.
  • to finish deliberating and bargaining before space colonisation.
  • to equip each space ship with the information necessary for deciding what to do with the space they grab. In order of increasing ambitiousness:
    • You could upload a few leaders' or owners' brains (or excellent predictive model thereof) and send them along with their respective colonisation ships; hoping that they will individually reach good decisions without discussing with the rest of humanity.
    • You could also equip each colonisation ship with the uploads of all other human brains that they might want to deliberate with (or excellent predictive models thereof), so that they can use those other human as discussion partners and data for their deliberation-efforts.
    • You also set up these uploads in a way that makes them figure out what bargain would have been struck on Earth; and then have each space ship individually implement this. Maybe this happens by default with acausal trade; or maybe everyone in some reasonably big coalition could decide to follow the decision of some specified deliberative process that they don't have time to run on Earth.
  • to use some communication scheme that lets you send your space ships ahead to compete in space, and then lets you send instructions to your own ships once you've finished deliberating on Earth.
    • E.g. maybe you could use cryptography to ensure that your space ships will follow instructions signed with the right code; which you only send out once you've finished bargaining. (Though I'm not sure if your bargaining-partners would be able to verify how your space ships would react to any particular message; so maybe this wouldn't work without significant prior coordination.)

I'm curious wheter you're optimistic about any of these options, or if you have something else in mind.

(Also, all of this assumes that defensive capabilities are a lot stronger than offensive capabilities in space. If offense is comparably strong, than we also have the problem that the cosmic commons might be burned in wars if we don't pause or reach some other agreement before space colonisation.)

The strategy-stealing assumption

Categorising the ways that the strategy-stealing assumption can fail:

  • Humans don't just care about acquiring flexible long-term influence, because
    • 4. They also want to stay alive.
    • 5 and 6. They want to stay in touch with the rest of the world without going insane.
    • 11. and also they just have a lot of other preferences.
    • (maybe Wei Dai's point about logical time also goes here)
  • It is intrinsically easier to gather flexible influence in pursuit of some goals, because
    • 1. It's easier to build AIs to pursue goals that are easy to check.
    • 3. It's easier to build institutions to pursue goals that are easy to check.
    • 9. It's easier to coordinate around simpler goals.
    • plus 4 and 5 insofar as some values require continuously surviving humans to know what to eventually spend resources on, and some don't.
    • plus 6 insofar as humans are otherwise an important part of the strategic environment, such that it's beneficial to have values that are easy-to-argue.
  • Jessica Taylor's argument require that the relevant games are zero sum. Since this isn't true in the real world:
    • 7. A threat of destroying value (e.g. by threatening extinction) could be used as a bargaining tool, with unpredictable outcomes.
    • ~8. Some groups actively wants other groups to have less resources, in which case they can try to reduce the total amount of resources more or less actively.
    • ~8. Smaller groups have less incentive to contribute to public goods (such as not increasing the probability of extinction), but benefit equally from larger groups' contributions, which may lead them to getting a disproportionate fraction of resources by defecting in public-goods games.
Imitative Generalisation (AKA 'Learning the Prior')

Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?

My understanding: After going through the process of finding z, you'll have a z that's probably too large for the human to fully utilise on their own, so you'll want to use amplification or debate to access it (as well as to generally help the human reason). If we didn't have z, we could train an amplification/debate system on D' anyway, while allowing the human and AIs to browse through D for any information that they need. I don't see how the existence of z makes amplification or debate any more aligned, but it seems plausible that it could improve competitiveness a lot. Is that the intention?

Bonus question: Is the intention only to boost efficiency, or do you think that IA will fundamentally allow amplification to solve more problems? (Ie., solve more problems with non-ridiculous amounts of compute – I'd be happy to count an exponential speedup as the latter.)

Prediction can be Outer Aligned at Optimum

Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we're in a semantics labyrinth)

Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation

Idk, if the real world is a simulation made by malign simulators, I wouldn't say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I'm in even if it's simulated. The simulators control everything that happens anyway, so if they want our AIs to behave in some particular way, they can always just make them do that no matter what we do.

you are changing the definition of outer alignment if you think it assumes we aren't in a simulation

Fwiw, I think this is true for a definition that always assumes that we're outside a simulation, but I think it's in line with previous definitions to say that the AI should think we're not in a simulation iff we're not in a simulation. That's just stipulating unrealistically competetent prediction. Another way to look at it is that in the limit of infinite in-distribution data, an AI may well never be able to tell whether we're in the real world or in a simulation that's identical to the real world; but they would be able to tell whether we're in a simulation with simulators who actually intervene, because it would see them intervening somewhere in its infinite dataset. And that's the type of simulators that we care about. So definitions of outer alignment that appeal to infinite data automatically assumes that AIs would be able to tell the difference between worlds that are functionally like the real world, and worlds with intervening simulators.

And then, yeah, in practice I agree we won't be able to learn whether we're in a simulation or not, because we can't guarantee in-distribution data. So this is largely semantics. But I do think definitions like this end up being practically useful, because convincing the agent that it's not individually being simulated is already an inner alignment issue, for malign-prior-reasons, and this is very similar.

Prediction can be Outer Aligned at Optimum
Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous.

Yup.

I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve

I mean, it's true that I'm mostly just trying to clarify terminology. But I'm not necessarily trying to propose a new definition – I'm saying that the existing definition already implies that malign priors are an inner alignment problem, rather than than an issue with outer alignment. Evan's footnote requires the model to perform optimally on everything it actually encounters in the real world (rather than asking it to do as well as it can across the multiverse, given its training data); so that definition doesn't have a problem with malign priors. And as Richard notes here, common usage of "inner alignment" refers to any case where the model performs well on the training data but is misaligned during deployment, which definitely includes problems with malign priors. And per Rohin's comment on this post, apparently he already agrees that malign priors are an inner alignment problem.

Basically, the main point of the post is just that the 11 proposals post is wrong about mentioning malign priors as a problem with outer alignment. And then I attached 3 sections of musings that came up when trying to write that :)

Load More