Richard_Ngo — AI Alignment Forum

Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.

By thinking about reward in this way, I was able to predict^[1] and encourage the success of this research direction.

Congratulations on doing this :) More specifically, I think there are two parts of making predictions: identifying a hypothesis at all, and then figuring out how likely the hypothesis is to be true or false. The former part is almost always the hard part, and that's the bit where the "reward reinforces previous computations" frame was most helpful.

(I think Oliver's pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)

Ty for the reply. A few points in response:

Of course, you might not know which problem your insights allow you to solve until you have the insights. I'm a big fan of constructing stylized problems that you can solve, after you know which insight you want to validate.
That said, I think it's even better if you can specify problems in advance to help guide research in the field. The big risk, then, is that these problems might not be robust to paradigm shifts (because paradigm shifts could change the set of important problems). If that is your concern, then I think you should probably give object-level arguments that solving auditing games is a bad concrete problem to direct attention to. (Or argue that specifying concrete problems is in general a bad thing.)

The bigger the scientific advance, the harder it is to specify problems in advance which it should solve. You can and should keep track of the unresolved problems in the field, as Neel does, but trying to predict specifically which unresolved problems in biology Darwinian evolution would straightforwardly solve (or which unresolved problems in physics special relativity would straightforwardly solve) is about as hard as generating those theories in the first place.

I expect that when you personally are actually doing your scientific research you are building sophisticated mental models of how and why different techniques work. But I think that in your community-level advocacy you are emphasizing precisely the wrong thing—I want junior researchers to viscerally internalize that their job is to understand (mis)alignment better than anyone else does, not to optimize on proxies that someone else has designed (which, by the nature of the problem, are going to be bad proxies).

It feels like the core disagreement is that I intuitively believe that bad metrics are worse than no metrics, because they actively confuse people/lead them astray. More specifically, I feel like your list of four problems is closer to a list of things that we should expect from an actually-productive scientific field, and getting rid of them would neuter the ability for alignment to make progress:

"Right now, by default research projects get one bit of supervision: After the paper is released, how well is it received?" Not only is this not one bit, I would also struggle to describe any of the best scientists throughout history as being guided primarily by it. Great researchers can tell by themselves, using their own judgment, how good the research is (and if you're not a great researcher that's probably the key skill you need to work on).

But also, note how anti-empiricism your position is. The whole point of research projects is that they get a huge amount of supervision from reality. The job of scientists is to observe that supervision from reality and construct theories that predict reality well, no matter what anyone else thinks about them. It's not an exaggeration to say that discarding the idea that intellectual work should be "supervised" by one's peers is the main reason that science works in the first place (see Strevens for more).
"Lacking objective, consensus-backed progress metrics, the field is effectively guided by what a small group of thought leaders think is important/productive to work on." Science works precisely because it's not consensus-backed—see my point on empiricism above. Attempts to make science more consensus-backed undermine the ability to disagree with existing models/frameworks. But also: the "objective metrics" of science are the ability to make powerful, novel predictions in general. If you know specifically what metrics you're trying to predict, the thing you're doing is engineering. And some people should be doing engineering (e.g. engineering better cybersecurity)! But if you try to do it without a firm scientific foundation you won't get far.
I think it's good that "junior researchers who do join are unsure what to work on." It is extremely appropriate for them to be unsure what to work on, because the field is very confusing. If we optimize for junior researchers being more confident on what to work on, we will actively be making them less truth-tracking, which makes their research worse in the long term.
Similarly, "it’s hard to tell which research bets (if any) are paying out and should be invested in more aggressively" is just the correct epistemic state to be in. Yes, much of the arguing is unproductive. But what's much less productive is saying "it would be good if we could measure progress, therefore we will design the best progress metric we can and just optimize really hard for that". Rather, since evaluating the quality of research is the core skill of being a good scientist, I am happy with junior researchers all disagreeing with each other and just pursuing whichever research bets they want to invest their time in (or the research bets they can get the best mentorship when working on).
Lastly, it's also good that "it’s hard to grow the field". Imagine talking to Einstein and saying "your thought experiments about riding lightbeams are too confusing and unquantifiable—they make it hard to grow the field. You should pick a metric of how good our physics theories are and optimize for that instead." Whenever a field is making rapid progress it's difficult to bridge the gap between the ontology outside the field and the ontology inside the field. The easiest way to close that gap is simply for the field to stop making rapid progress, which is what happens when something becomes a "numbers-go-up" discipline.

I think that e.g. RL algorithms researchers have some pretty deep insights about the nature of exploration, learning, etc.

They have some. But so did Galileo. If you'd turned physics into a numbers-go-up field after Galileo, you would have lost most of the subsequent progress, because you would've had no idea which numbers going up would contribute to progress.

I'd recommend reading more about the history of science, e.g. The Sleepwalkers by Koestler, to get a better sense of where I'm coming from.

I strongly disagree. "Numbers-Go-Up Science" is an oxymoron: great science (especially what Kuhn calls revolutionary science) comes from developing novel models or ontologies which can't be quantitatively compared to previous ontologies.

Indeed, in an important sense, the reason the alignment problem is a big deal in the first place is that ML isn't a science which tries to develop deep explanations of artificial cognition, but instead a numbers-go-up discipline.

And so the idea of trying to make (a subfield of) alignment more like architecture design, performance optimization or RL algorithms feels precisely backwards—it steers people directly away from the thing that alignment research should be contributing.

I suspect that many of the things you've said here are also true for humans.

That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.

Another way of putting this point: being pulled from the void is not a feature of LLM personas. It's a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.

What's the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the "pulling from the void" thing should be seen less as an odd thing that we're doing with AIs, and more as a claim about the nature of minds in general.

This feels related to the predictive processing framework, in which the classifications of one model are then predicted by another.

More tangentially, I've previously thought about merging cognitive biases and values, since we can view both of them as deviations from the optimal resource-maximizing policy. For example, suppose that I am willing to bet even when I am being Dutch booked. You could think of that as a type of irrationality, or you could think of it as an expression of me valuing being Dutch booked, and therefore being willing to pay to experience it.

This is related to the Lacanian/"existential kink" idea that most dysfunctions are actually deliberate, caused by subagents that are trying to pursue some goal at odds with the rest of your goals.

I think my thought process when I typed "risk-averse money-maximizer" was that an agent could be risk-averse (in which case it wouldn't be an EUM) and then separately be a money-maximizer.

But I didn't explicitly think "the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility", so I appreciate the clarification.

Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2.

In other words, your example rebuts the claim that an EUM can't prefer a probabilistic mixture of two options to the expectation of those two options. But that's not the claim I made.

Hmm, this feels analogous to saying "companies are an unnecessary abstraction in economic theory, since individuals could each make separate contracts about how they'll interact with each other. Therefore we can reduce economics to studying isolated individuals".

But companies are in fact a very useful unit of analysis. For example, instead of talking about the separate ways in which each person in the company has committed to treating each other person in the company, you can talk about the HR policy which governs all interactions within the company. You might then see emergent effects (like political battles over what the HR policies are) which are very hard to reason about when taking a single-agent view.

Similarly, although in principle you could have any kind of graph of which agents listen to which other agents, in practice I expect that realistic agents will tend to consist of clusters of agents which all "listen to" each other in some ways. This is both because clustering is efficient (hence animals having bodies made up of clusters of cells; companies being made of clusters of individuals; etc) and because when you even define what counts as a single agent, you're doing a kind of clustering. That is, I think that the first step of talking about "individual rationality" is implicitly defining which coalitions qualify as individuals.

a superintelligent AI probably has a pretty good guess of the other AI's real utility function based on its own historical knowledge, simulations, etc.

This seems very unclear to me—in general it's not easy for agents to predict the goals of other agents with their own level of intelligence, because the amount of intelligence aimed at deception increases in proportion to the amount of intelligence aimed at discovering that deception.

(You could look at the AI's behavior from when it was less intelligent, but then—as with humans—it's hard to distinguish sincere change from improvement at masking undesirable goals.)

But regardless, that's a separate point. If you can do that, you don't need your mechanism above. If you can't, then my objection still holds.

One argument for being optimistic: the universe is just very big, and there's a lot to go around. So there's a huge amount of room for positive-sum bargaining.

Another: at any given point in time, few of the agents that currently exist would want their goals to become significantly simplified (all else equal). So there's a strong incentive to coordinate to reduce competition on this axis.

Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there's some decision-theoretic argument roughly like "more powerful agents should continue to spend some of their resources on the values of their less-powerful ancestors, to reduce the incentives for inter-generational conflict", even agents with very simple goals might be motivated by it. I call this "the generational contract".

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments