Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter


[AN #121]: Forecasting transformative AI timelines using biological anchors

I predict that this will not lead to transformative AI; I don't see how an algorithmic trading system leads to an impact on the world comparable to the industrial revolution.

You can tell a story where you get an Eliezer-style near-omniscient superintelligent algorithmic trading system that then reshapes the world because it is a superintelligence, and that the researchers thought that it was not a superintelligence and so assumed that the downside risk was bounded, but both clauses (Eliezer-style superintelligence and researchers being horribly miscalibrated) seem unlikely to me.

[AN #121]: Forecasting transformative AI timelines using biological anchors

Sure, I mean, logistic regression has had economic value and it doesn't seem meaningful to me to say whether it is "aligned" or "inner aligned". I'm talking about transformative AI systems, where downside risk is almost certainly not limited.

[AN #120]: Tracing the intellectual roots of AI and AI alignment

But it does differ from behavioral cloning in that they stratify the samples

Fair point. In my ontology, "behavior cloning" is always with respect to some expert distribution, so I see the stratified samples as "several instances of behavior cloning with different expert distributions", but that isn't a particularly normal or accepted ontology.

Personally, I would've trained a single conditional model with a specified player-Elo for each move

Yeah it does seem like this would have worked better -- if nothing else, the predictions could be more precise (rather than specifying the bucket in which the current player falls, you can specify their exact ELO instead).

[AN #120]: Tracing the intellectual roots of AI and AI alignment

they compare DDLUS to a 2018 paper (DIAYN)

Note the paper itself is from July 2019. (Not everything in the newsletter is the latest news.)

I also wonder if different techniques do better on atari vs. mujoco environments for "unprincipled" reasons that make apples to apples comparisons difficult for techniques developed by different groups.

That seems quite likely to me, but one would hope that a good method also works in situations it wasn't designed for, so this still seems like a reasonable evaluation to me.

AGI safety from first principles: Introduction
  • "that is, without relying on other people’s arguments" doesn't feel quite right to me, since obviously a bunch of these arguments have been made before. It's more like: without taking any previous claims for granted.

Changed, though the way I use words those phrases mean the same thing.

Your list of 3 differs from my list of 3.

Yeah this was not meant to be a direct translation of your list. (Your list of 3 is encompassed by my first and third point.) You mentioned six things:

more compute, better algorithms, and better training data


replication, cultural learning, and recursive improvement

which I wanted to condense. (The model size point was meant to capture the compute case.) I did have a lot of trouble understanding what the point of that section was, though, so it's plausible that I've condensed it poorly for whatever point you were making there.

Perhaps the best solution is to just delete that particular paragraph? As far as I can tell, it's not relevant to the rest of the arguments, and this summary is already fairly long and somewhat disjointed.

“Unsupervised” translation as an (intent) alignment problem

Planned summary for the Alignment Newsletter:

We have previously seen that a major challenge for alignment is that our models may learn <@inaccessible information@>(@Inaccessible information@) that we cannot extract from them, because we do not know how to provide a learning signal to train them to output such information. This post proposes unsupervised translation as a particular concrete problem to ground this out.

Suppose we have lots of English text, and lots of Klingon text, but no translations from English to Klingon (or vice versa), and no bilingual speakers. If we train GPT on the text, it will probably develop a good understanding of both English and Klingon, such that it “should” have the ability to translate between the two (at least approximately). How can we get it to actually (try to) do so? Existing methods (both in unsupervised translation and in AI alignment) do not seem to meet this bar.

One vague hope is that we could train a helper agent such that a human can perform next-word prediction on Klingon with the assistance of the helper agent, using a method like the one in Learning the prior (AN #109).

Clarifying “What failure looks like” (part 1)

Planned summary for the Alignment Newsletter:

The first scenario outlined in <@What failure looks like@> stems from a failure to specify what we actually want, so that we instead build AI systems that pursue proxies of what we want instead. As AI systems become responsible for more of the economy, human values become less influential relative to the proxy objectives the AI systems pursue, and as a result we lose control over the future. This post aims to clarify whether such a scenario leads to _lock in_, where we are stuck with the state of affairs and cannot correct it to get “back on course”. It identifies five factors which make this more likely:

1. _Collective action problems:_ Many human institutions will face competitive (short-term) pressures to deploy AI systems with bad proxies, even if it isn’t in humanity’s long-term interest.

2. _Regulatory capture:_ Influential people (such as CEOs of AI companies) may benefit from AI systems that optimize proxies, and so oppose measures to fix the issue (e.g. by banning such AI systems).

3. _Ambiguity:_ There may be genuine ambiguity about whether it is better to have these AI systems that optimize for proxies, even from a long-term perspective, especially because all clear and easy-to-define metrics will likely be going up (since those can be turned into proxy objectives).

4. _Dependency:_ AI systems may become so embedded in society that society can no longer function without them.

5. _Opposition:_ The AI systems themselves may oppose any fixes we propose.

We can also look at historical precedents. Climate change has been exacerbated by factors 1-3, though if it does lead to lock in that will be “because of physics” unlike the case with AI. The agricultural revolution, which arguably made human life significantly worse, still persisted thanks to its productivity gains (factor 1) and the loss of hunter-gathering skills (factor 4). When the British colonized New Zealand, the Maori people lost significant control over their future, because each individual chief needed guns (factor 1), trading with the British genuinely made them better off initially (factor 3), and eventually the British turned to manipulation, confiscation and conflict (factor 5).

With AI in particular, we might expect that an increase in misinformation and echo chambers exacerbates ambiguity (factor 3), and that due to its general-purpose nature dependency (factor 4) may be more of a risk.

The post also suggests some future directions for estimating the _severity_ of lock in for this failure mode.

Planned opinion:

I think this topic is important and the post did it justice. I feel like factors 4 and 5 (dependency and opposition) capture the reasons I expect lock in, with factors 1-3 as less important but still relevant mechanisms. I also really liked the analogy with the British colonization of New Zealand -- it felt like it was in fact quite analogous to how I’d expect this sort of failure to happen.

Random note: initially I thought this post was part 1 of N, and only later did I realize the "part 1" was a modifier to "what failure looks like". That's partly why it wasn't summarized till now -- I was waiting for future parts to show up.

AGI safety from first principles: Introduction

Planned summary of this sequence for the Alignment Newsletter:

This sequence presents the author’s personal view on the current best arguments for AI risk, explained from first principles (that is, without taking any previous claims for granted). The argument is a specific instantiation of the _second species argument_ that sufficiently intelligent AI systems could become the most intelligent species, in which case humans could lose the ability to create a valuable and worthwhile future.

We should clarify what we mean by superintelligence, and how it might arise. The author considers intelligence as quantifying simply whether a system “could” perform a wide range of tasks, separately from whether it is motivated to actually perform those tasks. In this case, we could imagine two rough types of intelligence. The first type, epitomized by most current AI systems, trains an AI system to perform many different tasks, so that it is then able to perform all of those tasks; however, it cannot perform tasks it has not been trained on. The second type, epitomized by human intelligence and <@GPT-3@>(@Language Models are Few-Shot Learners@), trains AI systems in a task-agnostic way, such that they develop general cognitive skills that allow them to solve new tasks quickly, perhaps with a small amount of training data. This second type seems particularly necessary for tasks where data is scarce, such as the task of being a CEO of a company. Note that these two types should be thought of as defining a spectrum, not a binary distinction, since the type of a particular system depends on how you define your space of “tasks”.

How might we get AI systems that are more intelligent than humans? Assuming we can get a model to human-level intelligence, there are then three key advantages of an AI system that would allow it to go further. First, they can be easily replicated, suggesting that we could get a _collective_ superintelligence via a collection of replicated AI systems working together and learning from each other. Second, there are no limits imposed by biology, and so we can e.g. make the models arbitrarily large, unlike with human brains. Finally, the process of creation of AI systems will be far better understood than that of human evolution, and AI systems will be easier to directly modify, allowing for AI systems to recursively improve their own training process (complementing human researchers) much more effectively than humans can improve themselves or their children.

The second species argument relies on the argument that superintelligent AI systems will gain power over humans, which is usually justified by arguing that the AI system will be goal-directed. Making this argument more formal is challenging: the EU maximizer framework <@doesn’t work for this purpose@>(@Coherent behaviour in the real world is an incoherent concept@) and applying the intentional stance only helps when you have some prior information about what goals the AI system might have, which begs the question.

The author decides to instead consider a more conceptual, less formal notion of agency, in which a system is more goal-directed the more its cognition has the following properties: (1) self-awareness, (2) planning, (3) judging actions or plans by their consequences, (4) being sensitive to consequences over large distances and long time horizons, (5) internal coherence, and (6) flexibility and adaptability. (Note that this can apply to a single unified model or a collective AI system.) It’s pretty hard to say whether current training regimes will lead to the development of these capabilities, but one argument for it is that many of these capabilities may end up being necessary prerequisites to training AI agents to do intellectual work.

Another potential framework is to identify a goal as some concept learned by the AI system, that then generalizes in such a way that the AI system pursues it over longer time horizons. In this case, we need to predict what concepts an AI system will learn and how likely it is that they generalize in this way. Unfortunately, we don’t yet know how to do this.

What does alignment look like? The author uses <@intent alignment@>(@Clarifying "AI Alignment"@), that is, the AI system should be “trying to do what the human wants it to do”, in order to rule out the cases where the AI system causes bad outcomes through incompetence where it didn’t know what it was supposed to do. Rather than focusing on the outer and inner alignment decomposition, the author prefers to take a holistic view in which the choice of reward function is just one (albeit quite important) tool in the overall project of choosing a training process that shapes the AI system towards safety (either by making it not agentic, or by shaping its motivations so that the agent is intent aligned).

Given that we’ll be trying to build aligned systems, why might we still get an existential catastrophe? First, a failure of alignment is still reasonably likely, since (1) good behavior is hard to identify, (2) human values are complex, (3) influence-seeking may be a useful subgoal during training, and thus incentivized, (4) it is hard to generate training data to disambiguate between different possible goals, (5) while interpretability could help it seems quite challenging. Then, given a failure of alignment, the AI systems could seize control via the mechanisms suggested in <@What failure looks like@> and Superintelligence. How likely this is depends on factors like (1) takeoff speed, (2) how easily we can understand what AI systems are doing, (3) how constrained AI systems are at deployment, and (4) how well humanity can coordinate.

Planned opinion:

I like this sequence: I think it’s a good “updated case” for AI risk that focuses on the situation in which intelligent AI systems arise through training of ML models. The points it makes are somewhat different from the ones I would make if I were writing such a case, but I think they are still sufficient to make the case that humanity has work to do if we are to ensure that AI systems we build are aligned.

Note: There is currently a lot of stuff I want to cover in the newsletter, so this will probably go out in the 10/21 newsletter.

Hiring engineers and researchers to help align GPT-3

Relevant 80K podcast:

Load More