You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?
The basic analogy is roughly "if we want a baseline for how hard it will be to evaluate an AI's outputs on their own terms, we should look at how hard it is to evaluate humans' outputs on their own terms, especially in areas similar in some way to AI safety". My guess is that you already have lots of intuition about how hard it is to assess results, from your experience assessing grantees, so that's the intuition I was trying to pump. In particular, I'm guessing that you've found first hand that things are much harder to properly evaluate than it might seem at first glance.
The "we" in "we can point AIs toward and have some ability to assess" meant humans, not Open Phil.
If you think generic "humans" (or humans at e.g. Anthropic/OpenAI/Deepmind, or human regulators, or human ????) are going to be better at the general skill of evaluating outputs than yourself or the humans at Open Phil, then I think you underestimate the skills of you and your staff relative to most humans. Most people do not perform any minimal-trust investigations. So I expect your experience here to provide a useful conservative baseline.
I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later.
Indeed, I think you're a good role model in this regard and hope more people will follow your example.
It seems to me like the main crux here is that you're picturing a "phase transition" that kicks in in a fairly unpredictable way, such that a pretty small increase in e.g. inference compute or training compute could lead to a big leap in capabilities. Does that sound right?
I don't think this is implausible but haven't seen a particular reason to consider it likely.
The phrase I'd use there is "grokking general-purpose search". Insofar as general-purpose search consists of a relatively-simple circuit/function recursively calling itself a lot with different context-specific knowledge/heuristics (e.g. the mental model here), once a net starts to "find" that general circuit/function during training, it would grok for the same reasons grokking happens with other circuits/functions (whatever those reasons are). The "phase transition" would then be relatively sudden for the same reasons (and probably to a similar extent) as in existing cases of grokking.
I don't personally consider that argument strong enough that I'd put super-high probability on it, but it's at least enough to privilege the hypothesis.
Among other things, it seems important that there are a bunch of specific useful tasks we can point AIs toward and have some ability to assess on their own grounds (standards enforcement, security, etc.)
Do you think you/OpenPhil have a strong ability to assess standards enforcement, security, etc, e.g. amongst your grantees? I had the impression that the answer was mostly "no", and that in practice you/OpenPhil usually mostly depend on outside indicators of grantees' background/skills and mission-alignment. Am I wrong about how well you think you can evaluate grantees, or do you expect AI to be importantly different (in a positive direction) for some reason?
+1, this is probably going to be my new default post to link people to as an intro.
Brief responses to the critiques:
Results don’t discuss encoding/representation of abstractions
Totally agree with this one, it's the main thing I've worked on over the past month and will probably be the main thing in the near future. I'd describe the previous results (i.e. ignoring encoding/representation) as characterizing the relationship between the high-level and the high-level.
Definitions depend on choice of variables
The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. For instance, it doesn't make sense to use variables which "rotate together" the states of five different local patches of spacetime which are not close to each other. (For instance, those five different local patches will generally not be rotated together by default in an evolving agent's sensory feed.)
That does still leave degrees of freedom in how we represent all the local patches, but those are exactly the degrees of freedom which don't matter for natural abstraction. (Under the minimal latent formulation: we can represent each individual variable or set-of-variables-which-we're-making-independent-of-some-other-stuff in a different way without changing anything informationally. Under the redundancy formulation: assume our resampling process allows simultaneous resampling of small sets of variables, to avoid the thing where there's two variables very tightly coupled but they're otherwise independent of everything else. With that modification in place, same argument as the minimal latent formulation applies.)
Theorems focus on infinite limits, but abstractions happen in finite regimes
Totally agree with this one too, and it has also been a major focus for me over the past couple months.
I'd also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn't yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual "right" formalizations. (The more general principle here is to only add formality when it's the right formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don't yet know the full right formality, then we should sketch at the level we think we do know.)
Missing theoretical support for several key claims
Basically agree with this. In particular, I think the quoted block is indeed a place where I was a bit overexcited at the time and made too strong a claim. More generally, for a while I was thinking of "deterministic constraints" as basically implying "low-dimensional" in practice, based on intuitions from physics. But in hindsight, that's at least not externally-legibly true, and arguably not true in general at all.
Figuring out whether the Universality Hypothesis is true
... What we’re less convinced of is that the current theoretical approach is a good way to tackle this question. One worrying sign is that almost two years after the project announcement (and over three years after work on natural abstractions began), there still haven’t been major empirical tests, even though that was the original motivation for developing all of the theory. ... Of course sometimes experiments do require upfront theory work. But in this case, we think that e.g. empirical interpretability work is already making progress on the Universality Hypothesis, whereas we’re unsure whether the natural abstractions agenda is much closer to major empirical tests than it was two years ago.
See the section on "Low level of precision...". Also, You Are Not Measuring What You Think You Are Measuring is a very relevant principle here - I have lots of (not necessarily externally-legible) bits of evidence about a rough version of natural abstraction, but the details I'm still figuring out are (not coincidentally) exactly the details where it's hard to tell whether we're measuring the right thing.
Abstractions as a bottleneck for agent foundations: The high-level story for why abstractions seem important for formalizing e.g. values seems very plausible to us. It’s less clear to us whether they are necessary (or at least a good first step)
Yeah, I don't think this should be externally-legibly clear right now. I think people need to spend a lot of time trying and failing to tackle agent foundations problem themselves, repeatedly running into the need for a proper model of abstraction, in order for this to be clear.
Accelerating alignment research: The promise behind this motivation is that having a theory of natural abstractions will make it much easier to find robust formalizations of abstractions such as “agency”, “optimizer”, or “modularity”. ... To us, such an outcome seems unlikely, though it may still be worth pursuing
I probably put higher probability on success here then you do, but I don't think it should be legibly clear.
Interpretability: ... Figuring out the real-world meaning of internal network activations is one of the core themes of safety-motivated interpretability work. And reverse-engineering a network into “pseudocode” is not just some separate problem, it’s deeply intertwined. We typically understand the inputs of a network, so if we can figure out how the network transforms these inputs, that can let us test hypotheses for what the meaning of internal activations is.
An intuitive understanding of inputs plus a circuit is not, in general, sufficient to interpret the internal things computed by the circuit. Easy counterargument: neural nets are circuits, so if those two pieces were enough, we'd already be done; there would be no interpretability problem in the first place.
Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple - e.g. edge detectors in Olah's early work, or the sinusoidal elements in Neel's work on modular addition. But this falls apart quickly as the circuits get bigger - e.g. later layers in vision nets, once we get past early things like edge and texture detectors.
Low level of precision and formalization
I mentioned earlier the heuristic of "only add formality when it's the right formality; don't prematurely add ad-hoc formulations just for the sake of making things more formal".
More generally, if you're used to academia, then bear in mind the incentives of academia push towards making one's work defensible to a much greater degree than is probably optimal for truth-seeking. Formalization is one part of this: in academia, the incentive is usually to add ad-hoc formalization in order to get a full formal proof rather than a sketch, even if the ad-hoc formalization added does not match reality well. On the experimental side, the incentive is usually on bulletproof results, rather than gaining lots of information. (... and that's the better case. In the worse case, the incentive is on jumping through certain hoops which are nominally about bulletproofing, but don't even do that job very well, like e.g. statistical significance.) And yes, defensibility does have value even for truth-seeking, but there are tradeoffs and I advise against anchoring too much on academia.
With that in mind: both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don't think I currently have all the right pieces, and I'm trying to get the right pieces quickly. For that purpose, it's important to make the stuff I think I understand as legible as possible so that others can help. I try to accurately convey my models and epistemic state. But it's not important to e.g. make it easy for others to point out mistakes in places where I didn't think the formality was right anway. If and when I have all the pieces, then I can worry about defensible proof.
That said, I agree with at least some parts of the critique. Being both precise and readable at the same time is hard, man.
Few experiments
As we briefly discussed earlier, we think it’s worrying that there haven’t been major experiments on the Natural Abstraction Hypothesis, given that John thinks of it as mostly an empirical claim. We would be excited to see more discussion on experiments that can be done right now to test (parts of) the natural abstractions agenda! We elaborate on a preliminary idea in the appendix (though it has a number of issues).
I do love your experiment ideas! The experiments I ran last summer had a similar flavor - relatively-simple checks on MNIST nets - though they were focused on the "information at a distance" lens rather than the redundancy or minimal latent lenses.
Anyway, similar answer here as the previous section: at this point I'm mainly trying to get to the right answers quickly, not trying to provide some impressive defensible proof. I run experiments insofar as they give me bits about what the right answers are.
A particular pattern Nate has talked about is what I might call "reflection." The basic idea is that in order to do certain kinds of research effectively, you need to keep asking questions like "What am I actually trying to do here and why? What is my big-picture goal?", which are questions that might "change your aims" in some important sense. The idea is not necessarily that you're rewriting your own source code, but that you're doing the kind of reflection and self-modification a philosophically inclined, independent-minded human might do: "I've always thought I cared about X, but when I really think about the implications of that, I realize maybe I only care about Y" and such. I think that in Nate's ontology (and I am partly sympathetic), it's hard to disentangle something like "Refocusing my research agenda to line it up with my big-picture goals" from something like "Reconsidering and modifying my big-picture goals so that they feel more satisfying in light of all the things I've noticed about myself." Reflection (figuring out what you "really want") is a kind of CIS, and one that could present danger, if an AI is figuring out what it "really wants" and we haven't got specific reasons to think that's going to be what we want it to want.
I'll unpack a bit more the sort of mental moves which I think Nate is talking about here.
In January, I spent several weeks trying to show that the distribution of low-level world state given a natural abstract summary has to take a specific form. Eventually, I became convinced that the thing I was trying to show was wrong - the distributions did not take that form. So then what? A key mental move at that point is to:
I think that's the main kind of mental move Nate is gesturing at.
It's a mental move which comes up at multiple different levels when doing research. At the level of hours or even minutes, I try a promising path, find that it's a dead end, then need to back up and think about what I hoped to get from that path and how else to get it. At the level of months or years, larger-scale approaches turn out not to work.
I'd guess that it's a mental move which designers/engineers are also familiar with: turns out that one promising-looking class of designs won't work for some reason, so we need to back up and ask what was promising about that class and how to get it some other way.
Notably: that mental move is only relevant in areas where we lack a correct upfront high-level roadmap to solve the main problem. It's relevant specifically because we don't know the right path, so we try a lot of wrong paths along the way.
As to why that kind of mental move would potentially be highly correlated with dangerous alignment problems... Well, what does that same mental move do when applied to near-top-level goals? For instance, maybe we tasked the AI with figuring out corrigibility. What happens when it turns out that e.g. corrigibility as originally formulated is impossible? Well, an AI which systematically makes the move of "Why did I want X in the first place and how else can I get what I want here?" will tend to go look for loopholes. Unfortunately, insofar as the AI's mesa-objective is only a rough proxy for our intended target, the divergences between mesa-objective and intended target are particularly likely places for loopholes to be.
I personally wouldn't put nearly so much weight on this argument as Nate does. (Though I do think the example training process Holden outlines is pretty doomed; as Nate notes, disjunctive failure modes hit hard.) The most legible-to-me reason for the difference is that I think that kind of mental move is a necessary but less central part of research than I expect Nate thinks. This is a model-difference I've noticed between myself and Nate in the past: Nate thinks the central rate-limiting step to intellectual progress is noticing places where our models are wrong, then letting go and doing something else, whereas I think identifying useful correct submodels in the exponentially large space of possibilities is the rate-limiting step (at least among relatively-competent researchers) and replacing the wrong parts of the old model is relatively fast after that.
I think the missing piece here is that people who want to outsource the solving of alignment to AIs are usually trying to avoid engaging with the hard problems of alignment themselves. So the key difference is that, in B, the people outsourcing usually haven't attempted to understand the problem very deeply.
Seems like the easiest way to satisfy that definition would be to:
In summary, saying "accident" makes it sounds like an unpredictable effect, instead of painfully obviously risk that was not taken seriously enough.
Personally, I usually associate "accident" with "painfully obvious risk that was not actually mitigated" (note the difference in wording from "not taken seriously enough"). IIUC, that's usually how engineering/industrial "accidents" work, and that is the sort of association I'd expect someone used to thinking about industrial "accidents" to have.
At that point, the time at which we should have stopped is probably already passed, especially insofar as:
As written, this evaluation plan seems to be missing elbow-room. The AI which I want to not be widely deployed is the one which is almost but not quite capable of autonomous function in a test suite. The bar for "don't deploy" should be slightly before a full end-to-end demonstration of that capability.