Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Value Learning
Alignment Newsletter

Wiki Contributions

Comments

[AN #167]: Concrete ML safety problems and their relevance to x-risk

I think we're just debating semantics of the word "assumption".

Consider the argument:

A superintelligent AI will be VNM-rational, and therefore it will pursue convergent instrumental subgoals

I think we both agree this is not a valid argument, or is at least missing some details about what the AI is VNM-rational over before it becomes a valid argument. That's all I'm trying to say.


Unimportant aside on terminology: I think in colloquial English it is reasonable to say that this is "missing an assumption". I assume that you want to think of this as math. My best guess at how to turn the argument above into math would be something that looks like:

This still seems like "missing assumption", since the thing filling the ? seems like an "assumption".


Maybe you're like "Well, if you start with the setup of an agent that satisfies the VNM axioms over state-based outcomes, then you really do just need VNM to conclude 'convergent instrumental subgoals', so there's no extra assumptions needed". I just don't start with such a setup; I'm always looking for arguments of the form "we have a non-trivial chance of building an agent that causes an existential catastrophe". (Maybe readers don't have the same inclination? That would surprise me, but is possible.)

[AN #167]: Concrete ML safety problems and their relevance to x-risk

depending on what the agent is coherent over.

That's an assumption :P (And it's also not one that's obviously true, at least according to me.)

[AN #166]: Is it crazy to claim we're in the most important century?

Yeah, I agree the statement is false as I literally wrote it, though what I meant was that you could easily believe you are in the kind of simulation where there is no extraordinary impact to have.

Selection Theorems: A Program For Understanding Agents

Edited to

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values

and 

[...] the resulting agents can be represented as maximizing expected utility, if the agents don't have internal state.

(For the second one, that's one of the reasons why I had the weasel word "could", but on reflection it's worth calling out explicitly given I mention it in the previous sentence.)

Selection Theorems: A Program For Understanding Agents

Thanks for this and the response to my other comment, I understand where you're coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns).

As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don’t lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.

Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).

The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

New opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed.

Selection Theorems: A Program For Understanding Agents

Planned summary for the Alignment Newsletter:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because they tell us likely properties of the agents we build.

As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any non-dominated agent can be represented as maximizing expected utility. (What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.) If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.

Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).

The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

Planned opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there is a money-like resource over which the agent has no terminal preferences). Similarly, I don’t expect this research agenda to find a selection theorem that says that an existential catastrophe occurs _assuming only that the agent is intelligent_, but I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, because we think the assumptions involved in the theorems are quite likely to hold. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I would not actively discourage anyone from doing this sort of research, and I think it would be more useful than other types of research I have seen proposed.

Selection Theorems: A Program For Understanding Agents

At the same time, better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment.

The former statement makes sense, but can you elaborate on the latter statement? I suppose I could imagine selection theorems revealing that we really do get alignment by default, but I don't see how they quickly lead to solutions to AI alignment if there is a problem to solve.

Brain-inspired AGI and the "lifetime anchor"

ASSUMPTION 1: There’s a “secret sauce” of human intelligence, and it looks like a learning algorithm (and associated inference algorithm).

ASSUMPTION 2: It’s a fundamentally different learning algorithm from deep neural networks. I don’t just mean a different neural network architecture, regularizer, etc. I mean really different, like “involving probabilistic program inference algorithms” or whatever.

ASSUMPTION 3: The algorithm is human-legible, but nobody knows how it works yet.

ASSUMPTION 4: We'll eventually figure out this “secret sauce” and get Transformative AI (TAI).

These seem easily like the load-bearing part of the argument; I agree the stuff you listed follows from these assumptions but why should these assumptions be true?

I can imagine justifying assumption 2, and maybe also assumption 1, using biology knowledge that I don't have. I don't see how you justify assumptions 3 and 4. Note that assumption 4 also needs to include a claim that we figure out the "secret sauce" sooner than other paths to AGI, despite lots of effort being put into them already.

AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Hmm, you might want to reach out to CHAI folks, though I don't have a specific person in mind at the moment. (I myself am working on different things now.)

Distinguishing AI takeover scenarios

Planned summary for the Alignment Newsletter:

This post summarizes several AI takeover scenarios that have been proposed, and categorizes them according to three main variables. **Speed** refers to the question of whether there is a sudden jump in AI capabilities. **Uni/multipolarity** asks whether a single AI system takes over, or many. **Alignment** asks what goals the AI systems pursue, and if they are misaligned, further asks whether they are outer or inner misaligned. They also analyze other properties of the scenarios, such as how agentic, general and/or homogenous the AI systems are, and whether AI systems coordinate with each other or not. A [followup post](https://www.alignmentforum.org/posts/zkF9PNSyDKusoyLkP/investigating-ai-takeover-scenarios) investigates social, economic, and technological characteristics of these scenarios. It also generates new scenarios by varying some of these factors.

Since these posts are themselves summaries and comparisons of previously proposed scenarios that we’ve covered in this newsletter, I won’t summarize them here, but I do recommend them for an overview of AI takeover scenarios.

Load More