Safety Implications of LeCun's path to machine intelligence

Ivan Vendrov

Yann LeCun recently posted A Path Towards Autonomous Machine Intelligence, a high-level description of the architecture he considers most promising to advance AI capabilities.

This post summarizes the architecture and describes some implications for AI safety work if we accept the hypothesis that the first transformative AI will have this architecture.

Why is this a hypothesis worth considering?

LeCun has a track record of being ahead of mainstream academic research, from working on CNNs in the 90s to advocating for self-supervised learning back in 2014-2016 when supervised learning was ascendant.
LeCun runs Meta AI (formerly FAIR) which has enormous resources and influence to advance his research agenda, making it more likely that his proposed architecture will be built at scale. In general I think this is an underrated factor; AI research exhibits a great deal of path dependence, and most plausible paths to AI are not taken primarily because nobody is willing to take a big risk on them.
The architecture is dramatically different from the architectures commonly assumed (implicitly) in much AI alignment work, such as model-free deep RL and "GPT-3 but scaled up 10000x". This makes it a good robustness check for plans that are overly architecture-specific.

Architecture Overview

The Overall Agent

At a high level, the proposed architecture is a set of specialized cognitive modules. With the exception of the Actor and the Intrinsic Cost (see below) they are all deep neural networks trained with gradient descent.

The high level architecture of LeCun's proposed agent. Arrows indicate dependence; gradients flow backward through the thin arrows.

What is this agent doing, exactly? It's meant to be a general architecture for any autonomous AI agent, but LeCun repeatedly emphasizes video inputs and uses self-driving cars as a recurrent example, so the central use case is embodied agents taking actions in the physical world. Other talks I've seen by LeCun suggest he thinks understanding video is essential for intelligence, both by analogy to humans and by a heuristic argument about the sheer amount of data it contains.

The World Model

More than half the body of the paper is about designing and training the world model, the predictive model of the environment that the AI uses to plan its actions. LeCun explicitly says that "designing architectures and training paradigms for the world model constitute the main obstacles towards real progress in AI over the next decades."

Why are world models so important? Because the main limitation of current AI systems, according to LeCun, is their sample inefficiency - they need millions of expensive, dangerous real-world interactions to learn tasks that humans can learn with only a few examples. The main way to progress capabilities is to reduce the number of interactions a system needs before it learns how to act, and the most promising way is to learn predictive world models on observational data. (The GPT-3 paper Large Language Models are Few Shot Learners is a great example of this - a good enough predictive model of language enables much more sample-efficient task acquisition than supervised learning).

What will these world models look like? According to LeCun, they will be

Predictive but not generative: They will predict high-level features of the future environment but not be able to re-generate the whole environment. This is especially obvious for high-dimensional data like video, where predicting the detailed evolution of every pixel is vastly overkill if you're doing planning. But it could also apply to language agents like chatbots, for whom it may be more important to predict the overall sentiment of a user's reply than the exact sequence of tokens.
Uncertainty-aware: able to capture multimodal distributions over future evolutions of the world state (e.g. whether the car will turn left or right at the upcoming intersection), which LeCun expects to be modeled with latent variables. The ability to model complex uncertainty is the key property LeCun thinks is missing from modern large generative models, and leads him to conclude that "scaling is not enough".
Hierarchical: represent the world at multiple levels of abstraction, with more high-level abstract features evolving more slowly. This makes it computationally feasible to use the same model for the combination of long-term planning and rapid local decision making that characterizes intelligent behavior.
Unitary: AIs will trend towards having one joint world model across all modalities (text, images, video), timescales, and tasks, enabling hardware re-use and knowledge sharing (LeCun speculates that human "common sense" and ability to reason by analogy emerges from humans having a unitary world model). This suggests the trend towards "one giant model" we've seen in NLP will continue and broaden to include the rest of AI.

The Actor

The actor generates action sequences which minimize the cost (see below) according to the world-model's predictions. It generates these action sequences via some search method; depending on the task, this could be

classic heuristic search methods like Monte-Carlo tree search or beam search.
gradient-based optimization of the action sequence's embedding in some continuous space.

Optionally, one can use imitation learning to distill the resulting action sequence into a policy network. This policy network can serve as a fast generator of actions, analogous to Kahneman's System 1 thinking in humans, or to inform the search procedure like in the {AlphaGo, AlphaZero, MuZero} family of models.

Unlike the world model, the actor is not unitary - it's likely that different tasks will use different search methods and different policy networks.

The Cost

So what exactly is this agent optimizing? There is a hard-wired, non-trainable mapping from world states to a scalar "intrinsic cost". The actor generates plans that minimize the sum of costs over time, which makes costs mathematically equivalent to rewards in reinforcement learning.

I think the reason LeCun insists on using his unusual terminology is that he wants to emphasize that in this scheme, normative information does not come from an external source (like a reward provided by a human supervisor) but is an intrinsic drive hard-coded into the agent (like pain, hunger, or curiosity in humans).

The Configurator

The configurator is a component that modulates the behavior of all other components, based on inputs from all other components; it's not specified in any detail and mostly feels like a pointer to "all the component interactions LeCun doesn't want to think about".

It's especially critical from an alignment perspective because it modulates the cost, and thus is the only way that humans can intervene to change the motivations of the agent. LeCun speculates that we might want this modulation to be relatively simple, perhaps only specifying the relative weights of a linear combination of several basic hardcoded drives because this makes the agent easier to control and predict. He also mentions we will want to include "cost terms that implement safety guardrails", though what these terms are and how the configurator learns to modulate them is left unspecified.

Implications for AI Safety

Let's assume that the first transformative AI systems are built roughly along the lines LeCun describes. What would this imply for AI safety work?

Interpretability becomes much easier, because the agent is doing explicit planning with a structured world-model that is purely predictive. Provided we can understand the hidden states in the world model (which seems doable with a Circuits-style approach), we can directly see what the agent is planning to do and implement safety strategies like "check that the agent's plan doesn't contain any catastrophic world states before executing an action". Of course, a sufficiently powerful agent could learn to model our safety strategies and avoid them, but the relatively transparent structure of LeCun's architecture gives the defender a big advantage.
Most safety-relevant properties will be emergent from interaction rather than predictable in advance, similar to the considerations for Multi-agent safety. Most of the "intelligence" in the system (the world model) is aimed at increasing predictive accuracy, and the agent is motivated by relatively simple hard-coded drives; whether its intelligent behaviors are safe or dangerous will not be predictable in advance. This makes it less tractable to intervene on the model architecture and training process (including most theoretical alignment work), and more important to have excellent post-training safety checks including simulation testing, adversarial robustness and red-teaming.
Coordination / governance is relatively more important. Whether an AI deployment leads to catastrophic outcomes will mostly be a function not of the agent's properties, but of the safety affordances implemented by the people deploying it (How much power are they giving the agent? How long are they letting it plan? How well are they checking the plans? ). These safety affordances are likely to be increasingly expensive as the model's capabilities grow, likely following the computer systems rule of thumb that every nine of reliability costs you 10x, and possibly scale even worse than that. Ensuring this high alignment tax is paid by all actors deploying powerful AI systems in the world requires a very high level of coordination.

Conclusion and Unresolved Questions

Broadly, it seems that in a world where LeCun's architecture becomes dominant, useful AI safety work looks more analogous to the kind of work that goes on now to make self-driving cars safe. It's not difficult to understand the individual components of a self-driving car or to debug them in isolation, but emergent interactions between the components and a diverse range of environments require massive and ongoing investments in testing and redundancy.

Two important questions that remain are

How likely is it that this becomes the dominant / most economically important AI architecture? Some trends point towards it (success of self-supervised learning and unitary predictive models; model-based architectures dominant in economically important applications like self-driving cars and recommender systems), others point away (relative stagnation in embodied / video-based agents vs language models; success of model-free RL in complex video game environments like StarCraft and Dota 2).
Just how clean will the lines will be between model, actor, cost, and configurator? Depending on how the architecture is trained (and especially if it is trained end-to-end), it seems possible for the world-model or the configurator to start learning implicit policies, in a way that undermines interpretability and the safety affordances it creates.

29