I've been looking into the AI alignment problem last couple of days and came up with the following summary of what problems there are and why. Also, I'd prefer using the umbrella name of Human alignment problem, as AI alignment is just a subset of it.

  • The problem is that we don't know what we want.
  • And even if we individually knew, we couldn't agree with others. (opinion aggregation)
  • Even if we agreed with others what we want, it would be hard to implement it. (coordination)
    • Because of game theory, momentum, and because of disagreements about how to do it.
  • Maybe we can create something smarter than us that solves these problems.
  • But we don't know how to create something smarter than us.
  • Maybe we can create something that will start out dumber, but can learn and will eventually become smarter.
  • We are afraid that something like this could become very powerful very quickly, and it’s likely to kill us - either as a mere side-effect or because of conflicting goals. (AI alignment problem)
  • But we don't know how to describe what it should learn. (outer alignment)
  • So maybe we can just give examples of what we know we want it to learn. (ML training)
    • We could make it learn from humans, but data from humans are inconsistent.
  • But it's impossible to describe all the cases, so in practice the situations facing the ML model will be quite different. (distributional shift)
  • And if what we want the ML model to learn is very specific and complicated, it's quite likely that what the model learns will behave very differently outside of our examples than how we'd want it to. (inner alignment)
  • It will also be hard to distinguish the cases where it does and where it doesn't do what we want. (eliciting latent knowledge)
    • Generally, sufficiently capable ML models are hard to understand. (interpretability)
  • Especially if the model knows it can do more of the stuff from training when we are not looking. (deceptive mesa optimizers)
  • Also, if we realize it’s doing something else than what we wanted it to, it might be hard to change it, because you’d be interfering with its learned goals. (corrigibility)

This is just a summary of my current understanding of the problem landscape. I don't subscribe to the stated motivations and conclusions, but more about that some other time.

Please, let me know if I omitted or misrepresented some important aspect of the problem (given how simplified version it intends to be).

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 8:00 AM

Not an expert, but will try to comment:

And even if we individually knew, we couldn't agree with others. (opinion aggregation)

I think this does not belong to the list. Yes, it is an important problem, but unlike the rest of the list, it does not have the "I don't even know where I would start" quality. If you could extract one person's preferences and turn them into an equation, then you could e.g. just repeat the process several billion times (expensive, but simple in theory) and then make an average of the obtained equations or something.

Even if we agreed with others what we want, it would be hard to implement it. (coordination)

Again, similar thing. In theory you do not need to coordinate with your opponents. You could build a superintelligent machine unilaterally, ask it to defeat everyone who resists, and in the meanwhile keep integrating other people's preferences into its utility function. Difficult, but solvable in theory.

A greater problem with lack of coordination is that you cannot coordinate "please let's stop building the machines until we figure out how to build machines that will not destroy us". Because someone can unilaterally build a machine that will destroy the world. Not because they want to, but because the time pressure did not allow them to be more careful.

...generally, I think you made a good list of difficult things, but it does not pinpoint the parts that are the most difficult. (It's like a list of "heavy objects" that would include a whale along with Jupiter.)

Also seems to me that the explanation of "mesa optimizers" is wrong, but I am not sure about the correct explanation myself. I think it is more about "your machine could create virtual submachines and delegate some tasks to them, but even if the machine itself is aligned, it could unknowingly create an unaligned virtual submachine".

Hey, I agree that the first 3 bullets are clunky. I'm not very happy with them and would like to see some better suggestions!

A greater problem with lack of coordination is that you cannot coordinate "please let's stop building the machines until we figure out how to build machines that will not destroy us". Because someone can unilaterally build a machine that will destroy the world. Not because they want to, but because the time pressure did not allow them to be more careful.

Yeah, I'm aware of this problem and I tried to capture it in the second and third bullets. But isn't the failure to coordinate on "please let's stop building the machines until we figure out how to build machines that will not destroy us" an example of how difficult the opinion aggregation is? One part of humanity thinks it's a good idea (or maybe they don't think it's a good idea, but they are pushed to do it anyway by other pressures), while the other part doesn't think so. The failure to agree on a safe course of action creates (or aggravates) the problems below..

Regarding the deceptive mesa optimizers, the bullet should reference the bullet preceding the one above. Edited now. Ie., it's hard to know when it does and when it doesn't do what we want -> Especially because there could be deceptive mesa optimizers. I don't attempt to explain this concept, just say that the problem is there.

Did you mis-edit? Anyway using that for mental visualisation might end up with structure \n__like \n____this \n______therefore…