Adam Shimi

Epistemologist specialized in the difficulties of alignment. Currently at Conjecture, and running Refine.

Sequences

Epistemic Cookbook for Alignment
Reviews for the Alignment Forum
AI Alignment Unwrapped
Deconfusing Goal-Directedness

Wiki Contributions

Comments

Thanks for the kind words and useful devil's advocate! (I'm expecting nothing less from you ;p)

  1. I expect it's unusual that [replace methodology-1 with methodology-2] will be a pareto improvement: other aspects of a researcher's work will tend to have adapted to fit methodology-1. So I don't think the creation of some initial friction is a bad sign. (also mirrors therapy - there's usually a [take things apart and better understand them] phase before any [put things back together in a more adaptive pattern] phase)
    1. It might be useful to predict this kind of thing ahead of time, to develop a sense of when to expect specific side-effects (and/or predictably unpredictable side effects)

I agree that pure replacement of methodology is a massive step that is probably premature before we have a really deep understanding both of the researcher's approach and of the underlying algorithm for knowledge production. Which is why in my model, this comes quite late; instead the first step are more revealing the cached methodology to the researcher, and showing alternatives from History of Science (and Technology) to make more options and approaches credible for them.

Also looking at the "sins of the fathers" for philosophy of science (how methodologies have fucked up people across history) is part of our last set of framing questions. ;)
 

  1. I do think it's worth interviewing at least a few carefully selected non-alignment researchers. I basically agree with your alignment-is-harder case. However, it also seems most important to be aware of things the field is just completely missing.
    1. In particular, this may be useful where some combination of cached methodologies is a local maximum for some context. Knowing something about other hills seems useful here.
      1. I don't expect it'd work to import full sets of methodologies from other fields, but I do expect there are useful bits-of-information to be had.
    2. Similarly, if thinking about some methodology x that most alignment researchers currently use, it might be useful to find and interview other researchers that don't use x. Are they achieving [things-x-produces] in other ways? What other aspects of their methodology are missing/different?
      1. This might hint both at how a methodology change may impact alignment researchers, and how any negative impact might be mitigated.

Two reactions here:

  1. I agree with the need to find things that are missing and alternatives, which is where the history and philosophy of science works come to help. One advantage of it is that you can generally judge whether the methodology was successful or problematic in hindsight there, compared to interviews.
  2. I hadn't thought about interviewing other researchers. I expect it to be less efficient in a lot of ways than the HPS work, but I'm also now on the lookout for the option, so thanks!
  1. Worth considering that there's less of a risk in experimenting (kindly, that is) on relative newcomers than on experienced researchers. It's a good idea to get a clear understanding of the existing process of experienced researchers. However, once we're in [try this and see what happens] mode there's much less downside with new people - even abject failure is likely to be informative, and the downside in counterfactual object-level research lost is much smaller in expectation.

I see what you're pointing out. A couple related thoughts:

  1. The benefits of working with established researchers is that you have a historical record of what they did, which makes it easier to judge whether you're actually helping.
  2. I also expect helping established researchers to be easier on some dimensions, because they have more experience learning new models and leveraging them.
  3. Related to your first point, I don't worry too much about messing people up because the initial input will far less invasive than replacements of methodologies wholesale. But we're still investigating the risks to be sure we're not doing something net negative.

You probably know better than me, but I still have this intuition that seed-AI and FOOM have oriented the framing of the problem and the sort of question asked. I think people who came to agent foundations from different routes ended up asking slightly different questions.

I could totally be wrong though, thanks for making this weakness of my description explicit!

That's a great point!

There's definitely one big difference between how Scott defined it and how I'm using it, which you highlighted well. I think a better way of explaining my change is that in Scott's original example, the AI being flawed result in some sense in the alignment scheme (predict human values and do that) to be flawed too.

I hadn't made the explicit claim in my head or in the post, but thanks to your comment, I think I'm claiming that the version I'm proposing generalize one of the interesting part of the original definition, and let it be applied to more settings.

As for your question, there is a difference between flawed and not the strongest version. What I'm saying about interpretability and single-single is not that a flawed implementation of them would not work (which is obviously trivial), but that for the reductions to function, you need to solve a particularly ambitious form of the problem. And that we don't currently have a good reason to expect to solve this ambitious problem with enough probability to warrant trusting the reduction and not working on anything else.

So an example of a plausible solution (of course I don't have a good solution at hand) would be to create sufficient interpretability techniques that, when combined with conceptual and mathematical characterizations of problematic behaviours like deception, we're able to see if a model will end up having these problematic behaviours. Notice that this possible solution requires working on conceptual alignment, which the reduction to interpretability would strongly discourage.

To summarize, I'm not claiming that interpretability (or single-single) won't be enough if it's flawed, just that reducing the alignment problem (or multi-multi) to them is actually a reduction to an incredibly strong and ambitious version of the problem, that no one is currently tackling this strong version, and that we have no reason to expect to solve the strong version with such high probability that we should shun alternatives and other approaches.

Does that clarify your confusion with my model? 

Yeah, I will be posting updates, and probably the participants themselves will post some notes and related ideas. Excited too about how it's going to pan out!

Thanks for the comment!

To be honest, I had more trouble classifying you, and now that you commented, I think you're right that I got the wrong label. My reasoning was that your agenda and directions look far more explicit and precise than Paul or Evan's, which is definitely a more mosaic-y trait. On the other hand, there is the iteration that you describe, and I can clearly see a difference in terms of updating between you and let's say John/Eliezer.

My current model is that you're more palimpsest-y, but compared with most of us, you're surprisingly good at making your current iteration fit into a proper structure that you can make explicit and legible.

(Will update the post in consequence. ;) )

Nice post! Two things I particularly like are the explicit iteration (demonstrating by example how and why not to only use one framing), as well as the online learning framing.

The policy behaves in a competent yet undesirable way which gets low reward according to the original reward function.[2] This is an inner alignment failure, also known as goal misgeneralization. Langosco et al. (2022) provide a more formal definition and some examples of goal misgeneralization.

It seems like a core part of this initial framing relies on the operationalisation of "competent", yet you don't really point to what you mean. Notably, "competent" cannot mean "high-reward" (because of category 4) and "competent" cannot mean "desirable" (because of category 3 and 4). Instead you point at something like "Whatever it's incentivized to do, it's reasonably good at accomplishing it". I share a similar intuition, but just wanted to highlight that subtleties might hide there (maybe addressed in later framings but at least not mention at this point)

In general, we should expect that alignment failures are more likely to be in the first category when the test environment is similar to (or the same as) the training environment, as in these examples; and more likely to be in the second category when the test environment is very different from the training environment.

What comes to my mind (and is a bit mentionned after the quote) is that we could think of different hypotheses on the hardness of alignment as quantifying how similar the test environment must be to the training one to avoid inner misalignment. Potentially for harder versions of the problem, almost any difference that could tractably be detected is enough for the AI to behave differently.

I’d encourage alignment researchers to get comfortable switching between these different framings, since each helps guide our thinking in different ways. Framing 1 seems like the most useful for connecting to mainstream ML research. However, I think that focusing primarily on Framing 1 is likely to overemphasize failure modes that happen in existing systems, as opposed to more goal-directed future systems. So I tend to use Framing 2 as my main framing when thinking about alignment problems. Lastly, when it’s necessary to consider online training, I expect that the “goal robustness” version of Framing 3 will usually be easier to use than the “high-stakes/low-stakes” version, since the latter requires predicting how AI will affect the world more broadly. However, the high-stakes/low-stakes framing seems more useful when our evaluations of AGIs are intended not just for training them, but also for monitoring and verification (e.g. to shut down AGIs which misbehave).

Great conclusion! I particularly like your highlighting that each framing is more adapted to different purposes.

Well, isn't having multiple modules a precondition to something being modular? That seems like what's happening in your example: it has only one module, so it doesn't even make sense to apply John's criterion.

Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.

You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you'll end up burning several years on ideas which don't actually leave the field better off.

I agree that avoiding the Hard parts is rarely productive, but you also don't address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by trying to prove it is impossible instead of avoiding it. And just like with most impossibility results in TCS, it's possible that even if the precise formulation is impossible, it often just means that you need to reframe the problem a bit.

Mostly, I think the hard parts are things like "understand agency in general better" and "understand what's going on inside the magic black boxes". If your response to such things is "sounds hard, man", then you have successfully identified (some of) the Hard Parts.

I expect you would also say that a crucial hard part many people are avoiding is "how to learn human values?", right? (Not the true names, but a useful pointer)

The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what's going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!

I want to note that the failure mode of blind theory here is to accept any story, and thus make the requirement of a story completely impotent to guide research. There's an art (and hopefully a science) to finding stories that bias towards productive mistakes.

Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it's where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.

I expect you to partially disagree, but there's not always a "right" operationalization, and there's a failure mode where one falls in love with their neat operationalization, making the misses parts of the phenomena invisible.

Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.

I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you're going to answer that we have evidence that this doesn't work in Alignment and so it is avoiding the Hard part. Am I correct?

Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it's usually a sign of avoiding the Hard Parts.

One formal example of this is the relativization barrier in complexity theory, which tells you that you can't prove (and a bunch of other separations) using only techniques using algorithms as blackboxes instead of looking at the structure.

Once you're past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.

Agreed that it's a great pair of advice to keep in mind!

In what way is AF not open to new ideas? I think it is a bit scary to publish a post here, but that has more to do with it being very public, and less to do with anything specific about the AF. But if AF has a culture of being non welcoming of new ideas, maybe we should fix that?

It's not that easy to justify a post from a year ago, but I think that what I meant was that the alignment forum has a certain style of alignment research, and thus only reading it means you don't see stuff like CHAI research or other works that are and aim at alignment without being shared that much on the AF.

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes?

This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

Other possible names would then be either leaning into the complex systems view, so the (possibly incorrect) assumption is something like "non-complexity" or "linear/predictable responses"; or leaning into the optimisation paths analogy which might be something like "incremental improvement is ok" although that is pretty bad as a name.

Someone at Conjecture proposed linear too, but Newtonian physics isn't linear. Although I agree that the sort of behavior and reaction I'm pointing out fit within the "non-linear" category.

Load More