Why Not Just... Build Weak AI Tools For AI Alignment Research?

johnswentworth

“Weak” cognitive tools are clearly a thing, and are useful. Google search is a fine example. There are plenty of flavors of “weak AI” which are potentially helpful for alignment research in a similar way to google search.

In principle, I think there’s room for reasonably-large boosts to alignment research from such tools^[1]. Alas, the very large majority of people who I hear intend to build such tools do not have the right skills/background to do so (at least not for the high-value versions of the tools). Worse, I expect that most people who aim to build such tools are trying to avoid the sort of work they would need to do to build the relevant skills/background.

Analogy: A Startup Founder’s Domain Expertise (Or Lack Thereof)

Imagine a startup building tools meant to help biologists during their day-to-day work in the wetlab. I expect domain expertise to matter a lot here: I would guess that if none of the founders have ample personal experience doing research work in a wetlab, the chance of this startup building an actually-highly-useful wetlab product drops by about an order of magnitude. Our hypothetical startup might still “succeed” some other way, e.g. by pivoting to something else, or by being good at pitching their shitty product to managers who make purchasing decisions without actually using the product, or by building something very marginally useful and pricing it very cheaply. But their chance of building a wetlab product which actually provides a lot of value is pretty slim.

One might reply: but couldn’t hypothetical founders without domain experience do things to improve their chances? For instance, they could do a bunch of user studies on biologists working in wetlabs, and they could deploy the whole arsenal of UX study techniques intended to distinguish things-users-say-matter from things-which-actually-matter-to-users.

… and my response is that I was already assuming our hypothetical founders do that sort of thing. If the founders don’t have much domain experience themselves, and don’t do basic things like lots of user studies, then I’d guess their chance of building an actually-high-value wetlab product drops by two or three orders of magnitude, not just one order of magnitude. At that point it’s entirely plausible that we’d have to go through thousands of times more startups to find one that succeeded at building a high-value product.

How is this analogous to plans to build AI tools for alignment research?

So we want to build products (specifically AI products) to boost alignment research. The products need to help solve the hard parts of aligning AI, not just easy things where we can clearly see what’s going on and iterate on it, not just problems which are readily legible or conceptually straightforward. Think problems like e.g. sharp left turn, deception, getting what we measure, or at a deeper level the problem of fully updated deference, the pointers problem, value drift under self-modification, or ontology identification. And the tools need to help align strong AI; the sort of hacky tricks which fall apart under a few bits of optimization pressure are basically irrelevant at that point. (Otherwise the relevant conversation to have is not about how the tools will be useful, but about how whatever thing the tools are building will be useful.)

The problem for most people who aim to work on AI tools for alignment research is that they have approximately-zero experience working on those sorts of problems. Indeed, as far as I can tell, people usually turn to tool-building as a way to avoid working on the hard problems.

I expect failure modes here to mostly look like solving the wrong problems, i.e. not actually addressing bottlenecks. Here are some concrete examples, ordered by how well the tool-builders understand what the bottlenecks are (though even the last still definitely constitutes a failure):

Tool-builder: “We made a wrapper for an LLM so you can use it to babble random ideas!”
Me: “If ever there was a thing we’re not bottlenecked on, it’s random ideas. Also see Lessons from High-Dimensional Optimization.”
Tool-builder: “We made an IDE with a bunch of handy AI integration!”
Me: “Cool, that’s at least useful. I’m not very bottlenecked on coding - coding up an experiment takes hours to days, while figuring out what experiment would actually be useful takes weeks to months - but I’ll definitely check out this IDE.”
Tool-builder: “We made an AI theorem-prover!”
Me: “Sometimes handy, but most of my work is figuring out the right operationalizations (e.g. mathematical definitions or quantitative metrics) to use. A theorem-prover can tell me that my operationalization won’t let me prove the things I want to prove, but won’t tell me how close I am, so limited usefulness.”
Tool-builder: “We heard you’re bottlenecked on operationalizing intuitions, so we made an operationalizationator! Give it your intuitive story, and it will spit out a precise operationalization.”
Me: *looks at some examples* “These operationalizations are totally ad-hoc. Whoever put together the fine-tuning dataset didn’t have any idea what a robust operationalization looks like, did they?”

Note that, by the time we get to the last example on that list, domain experience is very much required to fix the problem, since the missing piece is relatively illegible.

Thus, my current main advice for people hoping to build AI tools for boosting alignment research: go work on the object-level research you’re trying to boost for a while. Once you have a decent amount of domain expertise, once you have made any progress at all (and therefore have any first-hand idea of what kinds of things even produce progress), then you can maybe shift to the meta-level^[2].

Metascience

Another way to come at the problem: rather than focus on tool-building for alignment research specifically, one might focus on tool-building for preparadigmatic science in general. And then the tool-builder’s problem is to understand how preparadigmatic science works well in general - i.e. metascience.

Of course, as with the advice in the previous section, the very first thing which a would-be metascientist of preparadigmatic fields should do is go get some object-level experience in a preparadigmatic field. Or, preferably, a few different preparadigmatic fields.

Then, the metascientist needs to identify consistent research bottlenecks across those fields, and figure out tools to address those bottlenecks.

How hard is this problem? Well, for comparison… I expect that “produce tools to 5-10x the rate of progress on core alignment research via a metascience-style approach” involves most of the same subproblems as “robustly and scalably produce new very-top-tier researchers”. And when I say “very-top-tier” here, I’m thinking e.g. pre-WWII nobel winners. (Pre-WWII nobel winners are probably more than 5-10x most of today’s alignment researchers, but tools are unlikely to close that whole gap themselves, or even most of the gap.) So, think about how hard it would be to e.g. produce 1000 pre-WWII-nobel-winner level researchers. The kinds of subproblems involved are about things like e.g. noticing the load-bearing illegible skills which make those pre-WWII nobel winners as good at research as they are, making those skills legible, and replicating them. That’s the same flavor of challenge required to produce tools which can 5-10x researcher productivity.

So it’s a hard problem. I am, of course, entirely in favor of people tackling hard problems.

For an example of someone going down this path in what looks to me like a relatively competent way, I'd point to some of Conjecture's work. (I don't know of a current write-up, but I had a conversation with Gabe about it which will hopefully be published within the next few weeks; I'll link it from here once it's up.) I do think the folks at Conjecture underestimate the difficulty of the path they're on, but they at least understand what the path is and are asking the right kind of questions.

What Tools I Would Build

To wrap up, I'll talk a bit about my current best guesses about what "high-value cognitive tools" would look like. Bear in mind that I think these guesses are probably wrong in lots of ways. At a meta level, cognitive tool-building is very much the sort of work where you should pick one or a handful of people to build the prototype for, focus on making those specific people much more productive, and get a fast feedback loop going that way. That's how wrong initial guesses turn into better later guesses.

Anyway, current best guesses.

I mentioned that the subproblems of designing high-value cognitive tools overlap heavily with the subproblems of creating new top-tier researchers. Over the past ~2 years I've been working on training people, generally trying to identify the mostly-illegible skills which make researchers actually useful and figure out how to install those skills. One key pattern I've noticed is that a lot of the key skills are "things people track in their head" - e.g. picturing an example while working through some math, or tracking the expected mental state of a hypothetical reader while writing, or tracking which constraints are most important while working through a complicated problem. Such skills have low legibility because they're mostly externally invisible - they're in an expert's head, after all. But once we know to ask about what someone is tracking in their head, it's often pretty easy and high-value for the expert to explain it.

... and once I'm paying attention to what-things-I'm-tracking-in-my-head, I also notice how valuable it is to externalize that tracking. If the tracked-information is represented somewhere outside my head, then (a) it frees up a lot of working memory and lets me track more things, and (b) it makes it much easier to communicate what I'm thinking to others.

So if I were building research tools right now, the first thing I'd try is ways to externalize the things strong researchers track in their heads. For instance:

Imagine a tool in which I write out mathematical equations on the left side, and an AI produces prototypical examples, visuals, or stories on the right, similar to what a human mathematician might do if we were to ask what the mathematician were picturing when looking at the math. (Presumably the interface would need a few iterations to figure out a good way to adjust the AI's visualization to better match the user's.)
Imagine a similar tool in which I write on the left, and on the right an AI produces pictures of "what it's imagining when reading the text". Or predicted emotional reactions to the text, or engagement level, or objections, etc.
Debugger functionality in some IDEs shows variable-values next to the variables in the code. Imagine that, except with more intelligent efforts to provide useful summary information about the variable-values. E.g. instead of showing all the values in a big tensor, it might show the dimensions. Or it might show Fermi estimates of runtime of different chunks of the code.
Similarly, in an environment for writing mathematics, we could imagine automated annotation with asymptotic behavior, units, or example values. Or a sidebar with an auto-generated stack trace showing how the current piece connects to everything else I'm working on.

^{^}
A side problem which I do not think is the main problem for “AI tools for AI alignment” approaches: there is a limit to how much of a research productivity multiplier we can get from google-search-style tools. Google search is helpful, but it’s not a 100x on research productivity (as evidenced by the lack of a 100x jump in research productivity shortly after Google came along). Fundamentally, a key part of what makes such tools “tools” is that most of the key load-bearing cognition still “routes through” a human user; thus the limit on how much of a productivity boost they could yield. But I do find a 2x boost plausible, or maybe 5-10x on the optimistic end. The more-optimistic possibilities in that space would be a pretty big deal.
(Bear in mind here that the relevant metric is productivity boost on rate-limiting steps of research into hard alignment problems. It’s probably easier to get a 5-10x boost to e.g. general writing or coding speed, but I don’t think those are rate-limiting for progress on the core hard problems.)
^{^}
A sometimes-viable alternative is to find a cofounder with the relevant domain experience. Unfortunately the alignment field is SUPER bottlenecked right now on people with experience working on the hard central problems; the number of people with more than a couple years experience is in the low dozens at most. Also, the cofounder does need to properly cofound; a mere advisor with domain experience is not a very good substitute. So a low time commitment from the domain expert probably won’t cut it.

[-]Thane Ruthenis3y61

Disclaimer: Haven't actually tried this myself yet, naked theorizing.

“We made a wrapper for an LLM so you can use it to babble random ideas!”

I'd like to offer a steelman of that idea. Humans have negative creativity — it takes conscious effort to come up with novel spins on what you're currently thinking about. An LLM babbling about something vaguely related to your thought process can serve as a source of high-quality noise, noise that is both sufficiently random to spark novel thought processes and relevant enough to prompt novel thoughts on the actual topic you're thinking about (instead of sending you off in a completely random direction). Tools like Loom seem optimized for that.

It's nothing a rubber duck or a human conversation partner can't offer, qualitatively, but it's more stimulating than the former, and is better than the latter in that it doesn't take up another human's time and is always available to babble about what you want.

Not that it'd be a massive boost to productivity, but might lower friction costs on engaging in brainstorming, make it less effortful.

... Or it might degrade your ability to think about the subject matter mechanistically and optimize your ideas in the direction of what sounds like it makes sense semantically. Depends on how seriously you'd be taking the babble, perhaps.

[-]Thane Ruthenis3y54

Me: *looks at some examples* “These operationalizations are totally ad-hoc. Whoever put together the fine-tuning dataset didn’t have any idea what a robust operationalization looks like, did they?”

... So maybe we should fund an effort to fine-tune some AI model on a carefully curated dataset of good operationalizations? Not convinced building it would require alignment research expertise specifically, just "good at understanding the philosophy of math" might suffice.

Finding the right operationalization is only partly intuition, partly it's just knowing what sorts of math tools are available. That is, what exists in the concept-space and is already discovered. That part basically requires having a fairly legible high-level mental map of the entire space of mathematics, and building it is very effortful, takes many years, and has very little return on learning any specific piece of math.

At least, it's definitely something I'm bottlenecked on, and IIRC even the Infra-Bayesianism people ended up deriving from scratch a bunch of math that latter turned out to be already known as part of imprecise probability theory. So it may be valuable to get some sort of "intelligent applied-math wiki" that babbles possible operationalizations at you/points you towards math-fields that may have the tools for modeling what you're trying to model.

That said, I broadly agree that the whole "accelerate alignment research via AI tools" doesn't seem very promising, either the Cyborgism or the Conditioning Generative Models directions. Not that I see any fundamental reason why pre-AGI AI tools can't be somehow massively helpful for research — on the contrary, it feels like there ought to be some way to loop them it. But it sure seems trickier than it looks at first or second glance.

69

Why Not Just... Build Weak AI Tools For AI Alignment Research?

69

Analogy: A Startup Founder’s Domain Expertise (Or Lack Thereof)

How is this analogous to plans to build AI tools for alignment research?

Metascience

What Tools I Would Build