(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)
I'd like to know what this figure is based on. In the linked post, Gwern writes:
The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character.
But in that linked post, there's no mention of "0.7" bits in particular, as far as I or cmd-f can see. The most relevant passage I've read is:
Claude Shannon found that each character was carrying more like 1 (0.6-1.3) bit of unguessable information (differing from genre to genre8); Hamid Moradi found 1.62-2.28 bits on various books9; Brown et al 1992 found <1.72 bits; Teahan & Cleary 1996 got 1.46; Cover & King 1978 came up with 1.3 bits10; and Behr et al 2002 found 1.6 bits for English and that compressibility was similar to this when using translations in Arabic/Chinese/French/Greek/Japanese/Korean/Russian/Spanish (with Japanese as an outlier). In practice, existing algorithms can make it down to just 2 bits to represent a character, and theory suggests the true entropy was around 0.8 bits per character.11
I'm not sure what the relationship is between supposedly unguessable information and human performance, but assuming that all these sources were actually just estimating human performance, and without looking into the sources more... this isn't just lots of uncertainty, but vast amounts of uncertainty, where it's very plausible that GPT-3 has already beaten humans. This wouldn't be that surprising, given that GPT-3 must have memorised a lot of statistical information about how common various words are, which humans certainly don't know by default.
I have a lot of respect for people looking into a literature like this and forming their own subjective guess, but it'd be good to know if that's what happened here, or if there is some source that pinpoints 0.7 in particular as a good estimate.
Thanks, computer-speed deliberation being a lot faster than space-colonisation makes sense. I think any deliberation process that uses biological humans as a crucial input would be a lot slower, though; slow enough that it could well be faster to get started with maximally fast space colonisation. Do you agree with that? (I'm a bit surprised at the claim that colonization takes place over "millenia" at technological maturity; even if the travelling takes millenia, it's not clear to me why launching something maximally-fast – that you presumably already know how to build, at technological maturity – would take millenia. Though maybe you could argue that millenia-scale travelling time implies millenia-scale variance in your arrival-time, in which case launching decades or centuries after your competitors doesn't cost you too much expected space?)
If you do agree, I'd infer that your mainline expectation is that we succesfully enforce a worldwide pause before mature space-colonisation; since the OP suggests that biological humans are likely to be a significant input into the deliberation process, and since you think that the beaming-out-info schemes are pretty unlikely.
(I take your point that as far as space-colonisation is concerned; such a pause probably isn't strictly necessary.)
I'm curious about how this interacts with space colonisation. The default path of efficient competition would likely lead to maximally fast space-colonisation, to prevent others from grabbing it first. But this would make deliberating together with other humans a lot trickier, since some space ships would go to places where they could never again communicate with each other. For things to turn out ok, I think you either need:
I'm curious wheter you're optimistic about any of these options, or if you have something else in mind.
(Also, all of this assumes that defensive capabilities are a lot stronger than offensive capabilities in space. If offense is comparably strong, than we also have the problem that the cosmic commons might be burned in wars if we don't pause or reach some other agreement before space colonisation.)
Categorising the ways that the strategy-stealing assumption can fail:
Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn't give you any additional safety properties?My understanding: After going through the process of finding z, you'll have a z that's probably too large for the human to fully utilise on their own, so you'll want to use amplification or debate to access it (as well as to generally help the human reason). If we didn't have z, we could train an amplification/debate system on D' anyway, while allowing the human and AIs to browse through D for any information that they need. I don't see how the existence of z makes amplification or debate any more aligned, but it seems plausible that it could improve competitiveness a lot. Is that the intention?
Bonus question: Is the intention only to boost efficiency, or do you think that IA will fundamentally allow amplification to solve more problems? (Ie., solve more problems with non-ridiculous amounts of compute – I'd be happy to count an exponential speedup as the latter.)
Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we're in a semantics labyrinth)
Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation
Idk, if the real world is a simulation made by malign simulators, I wouldn't say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I'm in even if it's simulated. The simulators control everything that happens anyway, so if they want our AIs to behave in some particular way, they can always just make them do that no matter what we do.
you are changing the definition of outer alignment if you think it assumes we aren't in a simulation
Fwiw, I think this is true for a definition that always assumes that we're outside a simulation, but I think it's in line with previous definitions to say that the AI should think we're not in a simulation iff we're not in a simulation. That's just stipulating unrealistically competetent prediction. Another way to look at it is that in the limit of infinite in-distribution data, an AI may well never be able to tell whether we're in the real world or in a simulation that's identical to the real world; but they would be able to tell whether we're in a simulation with simulators who actually intervene, because it would see them intervening somewhere in its infinite dataset. And that's the type of simulators that we care about. So definitions of outer alignment that appeal to infinite data automatically assumes that AIs would be able to tell the difference between worlds that are functionally like the real world, and worlds with intervening simulators.
And then, yeah, in practice I agree we won't be able to learn whether we're in a simulation or not, because we can't guarantee in-distribution data. So this is largely semantics. But I do think definitions like this end up being practically useful, because convincing the agent that it's not individually being simulated is already an inner alignment issue, for malign-prior-reasons, and this is very similar.
Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous.
I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve
I mean, it's true that I'm mostly just trying to clarify terminology. But I'm not necessarily trying to propose a new definition – I'm saying that the existing definition already implies that malign priors are an inner alignment problem, rather than than an issue with outer alignment. Evan's footnote requires the model to perform optimally on everything it actually encounters in the real world (rather than asking it to do as well as it can across the multiverse, given its training data); so that definition doesn't have a problem with malign priors. And as Richard notes here, common usage of "inner alignment" refers to any case where the model performs well on the training data but is misaligned during deployment, which definitely includes problems with malign priors. And per Rohin's comment on this post, apparently he already agrees that malign priors are an inner alignment problem.
Basically, the main point of the post is just that the 11 proposals post is wrong about mentioning malign priors as a problem with outer alignment. And then I attached 3 sections of musings that came up when trying to write that :)
Things I believe about what sort of AI we want to build:
Things I believe about how to choose definitions:
Things I believe about what these candidate definitions would imply:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the future
I don't think this is right. I've put my proposed modifications in cursive:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past [we don't have ground-truth for the future, so we can't test how well humans can reason about it] and how well humans think it would generalise to the future. Then, we train a separate network to predict what humans with access to the previous network would predict about the future.
(It might be a good idea to share some parameters between the second and first network.)
Oops, I actually wasn't trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.
Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:
1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I'm fine with "optimal performance" either requiring that GPT-3 magically returns these correct answers; or requiring that it returns some distribution along the lines that I defined in my post.)
2. The researchers will read what GPT-3 outputs, and there exists some string that causes them to go mad and give control over the world to GPT-3.
In this case, if we define optimal performance as "correctly predicting as many words as possible" or "achieve minimum total loss over the entire history of the world", I agree that optimal performance would plausibly involve taking over the world to feed itself the right questions. However, I think a better definition is "Correctly predicts every word it's asked to predict", because I think this tracks what (un)supervised learning actually selects for (absent inner alignment concerns).
To make that last point more clear; I'm claiming that in a supervised learning scenario where predicting label X always gives the neural network a small punishment, but then causes the training procedure to give the model 10 easy questions, the model will learn to not choose label X. (At least in the long run. In the short run, I could see the model being a little bit more likely to predict label X than expected, because easy questions could give the model a less steep gradient, so a neural network that predicts X would stick around a bit longer than it would have if X hadn't yielded easy questions. However, every time X is chosen, the nn is directionally pushed away from choosing it again; so in the infinite data limit, I think it would learn to not do that.) Conversely, if predicting label Y is correct on some inputs, but predicting Y immeidately lead to 10 hard questions, I don't think the model would learn to avoid label Y (though I'm unsure if the learning process would converge to choosing Y or just be unstable and never converge). I'm actually very curious if you agree with this; it seems like an important question.
(Conversely, I think that a robot RL-agent would be incentivised to control its future challenges, because of how RL loss functions are shaped. A definition of RL optimal behavior should naturally take this into account.)
Now, by defining optimal behavior as "Correctly predicts every word it's asked to predict", I'm saying that the model is outer aligned at optimum iff the model cannot simultaneously answer every question optimally and adversarially optimise against humans. And if we only consider the normal output channel, I'm saying it couldn't, because it would have to make incorrect predictions in order to broadcast its message; in which case it wouldn't be optimal.
If we also consider side-channels, this gets messier, because my chosen definition doesn't imply any constraints on what the AI could do with side channels. Thus, if it could use side channels to take over the world in any way, my definition would say it wasn't outer aligned. My preferred solution to this is just to intentionally ignore side channels when defining outer alignment at optimum (as you correctly inferred). Three reasons for this:
(Btw I'd say the bad image classifier is aligned but less performance-competitive than a good image classifier, though I'm sympathetic to the view that it doesn't make sense to talk about its alignment properties at all.)