Alex Gray née Alex Ray, much of my work is under that name. I'm interested in language model alignment, and especially techniques to get models to reason out loud.
Thanks so much for making this!
I'm hopeful this sort of dataset will grow over time as new sources come about.
In particular, I'd nominate adding MLSN (https://www.alignmentforum.org/posts/R39tGLeETfCZJ4FoE/mlsn-4-many-new-interpretability-papers-virtual-logit) to the list of newsletters in the future.
This seems like an overly alarmist take on what is a pretty old trend of research. Six years ago there was a number of universities working on similar models for the VizDoom competition (IIRC they were won by Intel and Facebook). It seems good to track this kind of research, but IMO the conclusions here are not supported at all by the evidence presented.
Do you have suggestions for domains where you do expect one-turn debate to work well, now that you've got these results?
I think your explanation of legibility here is basically what I have in mind, excepting that if it's human designed it's potentially not all encompassing. (For example, a world model that knows very little, but knows how to search for information in a library)
I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system. My take is that it is not "interpretability" to understand a legible system, but maybe I'm using the term differently than others here. This is why I don't think "interpretability" applies to systems that are designed to be always-legible. (In the second graph, "interpretability" is any research that moves us upwards)
I agree that the ability to come up with totally alien and untranslateable to humans ideas gives AGI a capabilities boost. I do think that requiring a system to only use legible cognition and reasoning is a big "alignment tax". However I don't think that this tax is equivalent to a strong proof that legible AGI is impossible.
I think my central point of disagreement with this comment is that I do think that it's possible to have compact world models (or at least compact enough to matter). I think if there was a strong proof that it was not possible to have a generally intelligent agent with a compact world model (or a compact function which is able to estimate and approximate a world model), that would be an update for me.
(For the record, I think of myself as a generally intelligent agent with a compact world model)
Two Graphs for why Agent Foundations is Important (according to me)
Epistemic Signpost: These are high-level abstract reasons, and I don’t go into precise detail or gears-level models. The lack of rigor is why I’m short form-ing this.
First Graph: Agent Foundations as Aligned P2B Fixpoint
P2B (a recursive acronym for Plan to P2B Better) is a framing of agency as a recursively self-reinforcing process. It resembles an abstracted version of recursive self improvement, which also incorporates recursive empowering and recursive resource gathering. Since it’s an improvement operator we can imagine stepping, I’m going to draw an analogy to gradient descent.
Imagine a highly dimensional agency landscape. In this landscape, agents follow the P2B gradient in order to improve. This can be convergent such that two slightly different agents near each other might end up at the same point in agency space after some number of P2B updates.
Most recursive processes like these have fixed point attractors — in our gradient landscape these are local minima. For P2B these are stable points of convergence.
Instead of thinking just about the fixed point attractor, lets think about the parts of agency space that flow into a given fixed point attractor. This is like analyzing watersheds on hilly terrain — which parts of the agency space flow into which attractors.
Now we can have our graph: it’s a cartoon of the “agency landscape” with different hills/valleys flowing into different local minimum, colored by which local minimum they flow into.
Here we have a lot of different attractors in agency space, but almost all of them are unaligned, what we need to do is get the tiny aligned attractor in the corner.
However it’s basically impossible to initialize an AI at one of these attractors, the best we can do is make an agent and try to understand where in agency space they will start. Building an AGI is imprecisely placing a ball on this landscape, which will roll along the P2B gradient towards its P2B attractor.
How does this relate to Agent Foundations? I see Agent Foundations as a research agenda to write up the criterion for characterizing the basin in agent space which corresponds to the aligned attractor. With this criterion, we can try to design and build an agent, such that when it P2Bs, it does so in a way that is towards an Aligned end.
Second: Agent Foundations as designing an always-legible model
ELK (Eliciting Latent Knowledge) formalized a family of alignment problems, eventually narrowing down to the Ontology Mapping Problem. This problem is about translating between some illegible machine ontology (basically it’s internal cognition) and our human ontology (concepts and relations that a person can understand).
Instead of thinking of it as a binary, I think we can think of the ontology mapping problem as a legibility spectrum. On one end of the spectrum we have our entirely illegible bayes net prosaic machine learning system. On the other end, we have totally legible machines, possibly specified in a formal language with proofs and verification.
As a second axis I’d like to imagine development progress (this can be “how far along” we are, or maybe the capabilities or empowerment of the system). Now we can show our graph, of different paths through this legibility vs development space.
Some strategies move away from legibility and never intend to get back to it. I think these plans have us building an aligned system that we don’t understand, and possibly can’t ever understand (because it can evade understanding faster than we can develop understanding).
Many prosaic alignment strategies are about going down in legibility, and then figuring out some mechanism to go back up again in legibility space. Interpretability, ontology mapping, and other approaches fit in this frame. To me, this seems better than the previous set, but still seem skeptical to me.
Finally my favorite set of strategies are ones that start legible and endeavor to never deviate from that legibility. This is where I think Agent Foundations is in this graph. I think there’s too little work on how we can build an Aligned AGI which is legible from start-to-finish, and almost all of them seem to have a bunch of overlap with Agent Foundations.
Aside: earlier I included a threshold in legibility space that‘s the “alignment threshold” but that doesn’t seem to fit right to me, so I took it out.
Maybe useful: an analogy this post brought to mind for me: Replacing “AI” with “Animals”.Hypothetical alien civilization, observing Early Earth and commenting on whether it poses a risk.
Doesn’t optimization nature produce non-agentic animals? It mostly does, but those aren’t the ones we’re concerned with. The risk is all concentrated in the agentic animals.
Basically every animal ever is not agentic. I’ve studied animals for my entire career and I haven’t found an agentic animal yet. That doesn’t preclude them showing up in the future. We have reasons to believe that not only are agents possible, but they are likely.
Even if agentic animals showed up, they would be vastly outnumbered by all the other animals. We believe that agency will give the agentic animals such a drastic advantage, that they will seem to take over the world in a very short amount of time.
(Possible that this is in one of the things you cite, and either I missed it or I am failing to remember it)
Hacking the Transformer Prior
Neural Network Priors
I spend a bunch of time thinking about the alignment of the neural network prior for various architectures of neural networks that we expect to see in the future.
Whatever alignment failures are highly likely under the neural network prior are probably worth a lot of research attention.
Separately, it would be good to figure out knobs/levers for changing the prior distribution to be more aligned (or produce more aligned models). This includes producing more interpretable models.
Analogy to Software Development
In general, I am able to code better if I have access to a high quality library of simple utility functions. My goal here is to sketch out how we could do this for neural network learning.
Naturally Occurring Utility Functions
One way to think about the induction circuits found in the Transformer Circuits work is that they are "learned utility functions". I think this is the sort of thing we might want to provide the networks as part of a "hacked prior"
A Language for Writing Transformer Utility Functions
Thinking Like Transformers provides a programming language, RASP, which is able to express simple functions in terms of how they would be encoded in transformers.
Concrete Research Idea: Hacking the Transformer Prior
Use RASP (or something RASP-like) to write a bunch of utility functions (such as the induction head functions).
Train a language model where a small fraction of the neural network is initialized to your utility functions (and the rest is initialized normally).
Study how the model learns to use the programmed functions. Maybe also study how those functions change (or don't, if they're frozen).
I think this could be a way to iteratively build more and more interpretable transformers, in a loop where we:
If we have a neural network that is eventually entirely made up of human-programmed functions, we probably have an Ontologically Transparent Machine. (AN: I intend to write more thoughts on ontologically transparent machines in the near future)
I think there’s a lot going on with your equivocating the speed prior over circuits w/ a speed prior over programs.
I think a lot of the ideas in this direction are either confused by the difference between circuit priors and program priors, or at least treating them as equivalent. Unfortunately a lot of this is vague until you start specifying the domain of model. I think specifying this more clearly will help communicating about these ideas. To start with this myself, when I talk about circuit induction, I’m talking about things that look like large randomly initialized bayes nets (or deep neural networks).
Program Induction Priors are bad: I would claim that any program induction priors (simplicity prior, speed prior, others) are almost always a bad fit for developing useful intuitions about the behavior of large random bayes net machines.
Confusion between circuit induction speed prior and simplicity prior: I think your point about double descent is wrong — in particular, the speed is largely unchanged in double descent experiments, since the *width* is the parameter being varied, and all deep neural networks of the same depth have approximately the same speed (unless you mean something weird by speed).
Circuit Simplicity: You give circuit-size and circuit-depth as examples of a “speed prior”, which seems pretty nonstandard, especially when describing it as “not the simplicity prior”.
More than Speed and Simplicity: I think there are other metrics that provide interesting priors over circuits, like likelihood under some initialization distribution. In particular, I think “likelihood under the initialization distribution” is the prior that matters most, until we develop techniques that let us “hack the prior”.
Connection to Infinite-Size Neural Networks: I think research about neural networks approaching/at the infinite limit looks a lot like physics about black holes — and similarly can tell us interesting things about dynamics we should expect. In particular, for systems optimized by gradient descent, we end up with infinitesimal/nonexistent feature learning in the limit — which is interesting because all of the sub-modules/sub-circuits we start with are all we’ll ever have! This means that even if there are “simple” or “fast” circuits, if they’re not likely under the initialization distribution, then we expect they’ll have a vanishingly small effect on the output. (One way of thinking about this is in terms of the NTK, that even if we have extremely powerfully predictive modules, their predictive power will be overwhelmed by the much more common and simple features)
Hacking the Prior: Right now we don’t have a good understanding of the behavior of partially-hand coded neural networks, but I think they could serve as a new/distinct class of models (with regards to what functions are likely under the initialization distribution). Concretely, this could look like us “hand-programming” circuits or parts of neural networks, then randomly initializing the rest, and see if during training the model learns to use those programmed functions.
Interpretability ChallengesInspired by a friend I've been thinking about how to launch/run interpretability competitions, and what the costs/benefits would be.
I like this idea a lot because it cuts directly at one of the hard problems of spinning up in interpretability research as a new person. The field is difficult and the objectives are vaguely defined; it's easy to accidentally trick yourself into seeing signal in noise, and there's never certainty that the thing you're looking for is actually there.
On the other hand, most of the interpretability-like interventions in models (e.g. knowledge edits/updates to transformers) make models worse and not better -- they usually introduce some specific and contained deficiency (e.g. predict that the Eiffel Tower is in Rome, Italy).
So the idea for Interpretability Challenges would be to use existing methods (or possibly invent new ones) to inject concrete "things to find" inside of models, release those models as challenges, and then give prizes for finding things.
Some ways this might work:
Please reach out to me if you're interested in helping with efforts like this.
I with more of the language alignment research folks were looking into how current proposals for aligning transformers end up working on S4 models.
(I am one of said folks so maybe hypocritical to not work on it)
In particular it seems like there's way in which it would be more interpretable than transformers:
It does seem like they're not likely to be competitive with transformers for short-context modeling anytime soon, but if they end up being differentially alignment-friendly, then we could instead try to make them more competitive.
(In general I think it's much easier to make an approach more competitive than it is to make it more aligned)