LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
It might be worth going into the problem of fully updated deference. I don't think it's necessarily always a problem, but also it does stop utility aggregation and uncertainty from being a panacea, and the associated issues are probably worth a bit of discussion. And as you likely know, there isn't a great journal citation for this, so you could really cash in when people want to talk about it in a few years :P
Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.
This is fine if you want your AI to stack blocks on top of other blocks or something.
But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you're no longer just shooting for a 99%ile policy. You want the AI to do something unlikely, and so of course, you can't simultaneously restrict it to only do likely things.
Now, you can construct a policy where each individual action is exactly 99%ile on some distribution of human actions. The results quickly go off-distribution and end up somewhere new - chaining together 100 99%ile actions is not the same as picking a 99%ile policy. This means you don't get the straightforward quantilizer-style guarantee, but maybe it's not all bad - after all, we wanted to go off distribution to cure cancer, and maybe by taking actions with some kind of moderation we can get some other kind of benefits we don't quite understand yet. I'm reminded of Peli Grietzer's recent post on virtue ethics.
I feel like the "obvious" thing to do is to ask how rare (in bits) the post-opitimization EV is according to the pre-optimization distribution. Like, suppose that pre-optimization my probability distribution over utilities I'd get is normally distributed, and after optimizing my EV is +1 standard deviation. Probability of doing that well or better is 0.158, which in bits is 2.65 bits.
Seems indifferent to affine transformation of the utility function, adding irrelevant states, splitting/merging states, etc. What are some bad things about this method?
I'm definitely guilty of getting a disproportionate amount of information from the AI safety community.
I don't really have a good cure for it, but I do think having a specific question helps - it's simply not practical to keep up with the entire literature, and I don't have a good filtering mechanism for what to keep up with in general, but if I'm interested in a specific question I can usually crawl the search engine + citation network well enough to get a good representation of the literature.
You might be interested in Reducing Goodhart. I'm a fan of "detecting and avoiding internal Goodhart," and I claim that's a reflective version of the value learning problem.
Partial solutions which likely do not work in the limit
Taking meta-preferences into account
Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.
As a fan of accounting for meta-preferences, I've made my peace with multiple fixed points, to the extent that it now seems wild to expect otherwise.
Like, of course there are multiple ways to model humans as having preferences, and of course this can lead to meta-preference conflicts with multiple stable outcomes. Those are just the facts, and any process that says it has a unique fixed point will have some place where it puts its thumb on the scales.
Plenty of the fixed points are good. There's not "one right fixed point," which makes all other fixed points "not the right fixed point" by contrast. We just have to build a reasoning process that's trustworthy by our own standards, and we'll go somewhere fine.
Nice. My main issue is that just because humans have values a certain way, doesn't mean we want to build an AI that way, and so I'd draw pretty different implications for alignment. I'm pessimistic about anything that even resembles "make an AI that's like a human child," and more interested in "use a model of a human child to help an inhuman AI understand humans in the way we want."
The world model is learnt mostly by unsupervised predictive learning and so is somewhat orthogonal to the specific goal. Of course in practice in a continual learning setting, what you do and pay attention to (which is affected by your goal) will affect the data input to the unsupervised learning process?
afaict, a big fraction of evolution's instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc.
Patterns of behavior (some of which I'd include in my goals) encoded in my model can act in a way that's somewhere between unconscious and too obvious to question - you might end up doing things not because you have visceral feelings about the different options, but simply because your model is so much better at some of the options that the other options never even get considered.
A nice exposition.
For myself I'd prefer the same material much more condensed and to-the-point, but I recognize that there are publication venues that prefer more flowing text.
E.g. compare
to
Two suggestions for things to explore:
People often care about the Nash equilibrium of games. For the simple game with perfect information this might be trivial, but it's at least a little interesting with imperfect information.
Second, What about bargaining? Attacking and defending is costly, and AIs might be able to make agreements that they literally cannot break, essentially turning a multipolar scenario into a unipolar scenario where the effective goals are achieving a Pareto optimum of the original goals. Which Pareto optimum exactly will depend on things like the available alternatives, i.e. the power differential. Not super familiar with the bargaining literature so I can't point you at great academic references, just blog posts.
My thoughts on the strategy are that this is overly optimistic. This picture where you have ten AGIs and exactly one of them is friendly is unlikely due to the logistic success curve. Or if the heterogeneity of the AGIs is due to heterogeneity of humans (maybe Facebook builds one AI and Google builds the other, or maybe there are good open-source AI tools that let lots of individuals build AGIs around the same time) rather than stochasticity of outcomes given humanity's best AGI designs, why would the lab building the unfriendly AGI also use your safeguard interventions?
I also expect that more reaslistic models will increasingly favor the leader, as they can bring to bear information and resources in a way that doesn't just look like atomic "Attack" or "Defend" actions. This isn't necessarily bad, but it definitely makes it more important to get things right first try.