How do we become confident in the safety of a machine learning system?

[-]adamShimi4y170

This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.

What this gives us is a way of combining the output of many disparate epistemic strategies to get well structured and directly relevant knowledge about alignment and how our proposals would fare. This is great, because now, we can combine many different methods of investigation (theory arguments, philosophical approaches, empirical studies of analogous systems and problems) and try to tie them to a common narrative (pun intended) about alignment.

Of course, we should expect that some things we want to learn about don't fit neatly in there, but training stories are still surprisingly inclusive. For example we could expect that reasoning about potential problems of AGI, in the very conceptual/philosophical/theoretical way we favor on the AF, doesn't fit a framework focused on justifying a given approach. Yet training stories also includes the probing of their rationale, and finding a new problem/issue allows new probing and refinement, like the very theoretical computer science model presented by Paul in his research methodology post.

There is indeed one thing this post doesn't get into: exactly which epistemic strategies can and should we use to argue for each part of a training story, and to break and falsify each. Still, I find that having a framing for combining and linking the output of the existing and new epistemic strategies is already quite an accomplishment. Plus it leaves me some work to do on clarifying and distilling the epistemic strategies of alignment.

Last but not least, I really like the name "story" for two reasons:

First this actually captures what most of these reasoning feel like. They're not so much theories than narratives, and using the word story makes that clear and explicit.
But more importantly, "story" makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that most of our knowledge will take a form like that. So the word reminds us daily to not feel too comfortable with our ideas and intuitions, as we always risk falling for our own inventions.

[-]evhub4y40

This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.

Glad you think so! I definitely agree and am planning on using this framework in my own research going forward.

"story" makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that most of our knowledge will take a form like that. So the word reminds us daily to not feel too comfortable with our ideas and intuitions, as we always risk falling for our own inventions.

Yep, this is definitely intentional. I think in many ways just thinking about inner alignment as avoiding proxy-aligned mesa-optimizers can give you false confidence in your training story because you reason “of course I won't get that specific failure model”—but the problem is that you need to couple some reason that you won't get the wrong thing with some strong reason that you actually will get the right thing to really be confident in your training process's safety.

[-]Daniel Kokotajlo2y1313

I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. "we have to solve inner alignment and also outer alignment" or "we just have to make sure it isn't scheming."

Anyone -- and in particular Evhub -- have updated views on this post with the benefit of hindsight? Should we e.g. try to get model cards to include training stories?

[-]ryan_greenblatt2y72

Anyone -- and in particular Evhub -- have updated views on this post with the benefit of hindsight?

I intuitively don't like this approach, but I have trouble articulating exactly why. I've tried to explain a bit in this comment, but I don't think I'm quite saying the right thing.

One issue I have is that it doesn't seem to nicely handle interactions between the properties of the AI and how it's used. You can have an AI which is safe when used in some ways, but not always. This could be due to approaches like control (which mostly route around mechanistic properties of the AI), but also potentially things like using monitoring ensembles to handle lack of robustness and paying AIs rather than aligning them.

Another problem I have is that this doesn't very naturally incorporate various non-mechanistic analysis targeting specific threat models which IMO should be (and will be) very central. E.g., we built a wide variety of model organisms which are closely analogous to our training and deployment environment and which aim to uncover potential reward hacking failure modes and these model organisms didn't demonstrate any issues. Same for things like adversarially testing for clear misalignment: it doesn't result in a mechanistic model, but feels very central.

To be clear, I think all the things I discussed above can be discussed in this framework, but it feels quite unnatural and the decomposition doesn't seem like it's doing any work.

I think the type of mechanistic analysis proposed here seems quite aspirational with the current state of technology such that it feels odd to center it. Or the mechanistic analysis you do will apply to all training runs and no safety interventions will effect it such that it's more like useful background than a key part of analyzing different safety measures. To be clear, we will want to do some mechanistic analysis and have some space of mechanistic hypothesis. But this feels more like the background threat model than the core safety case due to difficulties in testing. We can also somewhat test these mechanistic hypothesis with experiments that don't require huge technological break throughs, but this seems more like an important sub-component of a safety case than the main thing.

Perhaps Evan thinks we're totally screwed (or at least can't obtain high confidence) without strong mechanistic analysis such that centering this is good. I think high confidence seems unclear and disagree with totally screwed. It's possible that my views here partially come down to a difference of opinion with Evan where he thinks that deceptive alignment is very likely given usage of models capable of powerful goal-oriented behavior where as I think this is uncertain. Further, I think it's reasonably likely (perhaps 1/3) that I'll end up being very confident that deceptive alignment is very unlikely at the point when we have powerful AIs (due to experiments and further conceptual reasoning).

More generally, I feel like the way I currently talk and think about safety cases and similar topics doesn't seem nicely fit into training stories. I think the way I currently do it is better, but I'm not entirely certain and I haven't tried the training stories approach much.

I should also note that a general approach like training stories seems much better than a decomposition like "inner alignment" vs "outer alignment" which is supposing a particular approach to solving the problem. (I do think that "inner misalignment" vs "outer misalignment" is reasonable decomposition of threat models in AIs produced with ML. But these are threat models, not problems to be solved and there are many routes to solving them. See here for more discussion.)

I think I prefer the default trajectory of safety cases and RSP more than what would happen with additional emphasis on training stories, but I'm uncertain.

[-]evhub2y50

I still really like my framework here! I think this post ended up popularizing some of the ontology I developed here, but the unfortunate thing about that post as the one that popularized this is that it doesn't really provide an alternative.

It’s worth noting that there are ways to potentially build advanced or transformative AI that don’t assume the emergency of agency (and in fact might rely on the opposite) such as the aforementioned Microscope AI or STEM AI. ↩︎
Obviously this isn’t fair because in neither of these cases was Paul trying to write a training goal; but nevertheless I think that the second example that I give is a really good example of what I think a training goal should look like. ↩︎
For example, instead of using transparency and interpretability tools, you might instead try to make use of AI cognitive science, as I discuss in the final section on “Exploring the landscape of possible training stories.” ↩︎
It’s worth noting that while guaranteeing a short horizon length might be quite helpful for preventing deception, a short horizon length alone isn’t necessarily enough to guarantee the absence of deception, since e.g. a model with a short horizon length might cooperate with future versions of itself in such a way that looks more like a model with a long horizon length. See “Open Problems with Myopia” for more detail here. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

68

How do we become confident in the safety of a machine learning system?

68

What’s a training story?

Training story components

How mechanistic does a training goal need to be?

Relationship to inner alignment

Do training stories capture all possible ways of addressing AI safety?

Evaluating proposals for building safe advanced AI

Case study: Microscope AI

Exploring the landscape of possible training stories

Training story sensitivity analysis