Currently the most powerful techniques for getting a language model to act as an agent is via RLHF and similar approaches. For example, ChatGPT was trained to be an agent that tries to give humans answers that they want. Another approach is taking a LLM and getting it to predict what the agent you want would do (this appears to be how most of the LLM chatbots before ChatGPT worked).

An issue with both of these is that it's difficult to understand their goal. The prototypical example of an agent is AIXI, and its goal is simple to understand: maximize reward in the deployment environment.

In this post, I'll present a way to turn LLMs into agents such that we can approximately model them as a utility maximizer. The purpose is to make it easier to think about their alignment.

The most ambitious outcome is this becomes a slightly easier model to study alignment in, while still being competitive with RLHF. More modestly, I think maybe studying it can provide insights that could help build intuition for RLHF models, even though they aren't exactly the same. In particular, we can present more concretely "these are issues that an agent based on a LLM could have; to be safe we should assume that RLHF will have them until shown otherwise".

The agent: Monte Carlo tree search, using the LLM as a world model

We start with a purely predictive, "raw", LLM. No fine-tuning or reinforcement learning has been done.

We will construct an agent that communicates with a human over text. At the end of the conversation the human scores the agent, and the agent's goal is to maximize this score.

First choose an entropy coding, such as the arithmetic coding, that uses the LLM for the source distribution. Each message will be compressed separately (but using the previous messages of the conversation as context for the LLM).

We now perform a Monte Carlo tree search over conversations. The "moves" are symbols in a compressed message. The user is assumed to move uniformly randomly instead of according to a strategy. Note that a uniform random distribution over the compressed strings corresponds to the LLM's distribution over the plaintext strings.

The game ends when the human gives the agent a score. (During the tree search, this is also estimated using the LLM, just as the user themselves is indirectly simulated using it via the coding.)

The LLM can be fine tuned on user responses so it can model them more accurately. You can also fine tune it on the agent, though you do run the risk of training a powerful agent into the LLM. It also isn't strictly necessary anyways since we are doing a tree search for the agent, not just sampling from the LLM.

(There is probably an alternative where you instead adjust the exploration term so that it explores in proportion to the probability. I couldn't quite figure it out, and using an entropy coding generalizes to other search algorithms anyways.)

Analysis

The agent is kind of like an approximation to AIXI. The LLM replaces Solomonoff induction and Monte Carlo tree search replaces arg max.

By compressing the agents messages, you make it easy for Monte Carlo tree search to find sensible plans.

RLHF usually has a myopia property. The agent above doesn't have that, but we could modify it to have it by having the user score each message (and having the tree search only optimize for the next reward).

We could give the agent access to a repl. This would test how well the underlying LLM can indirectly predict the real world. For example, if it writes a program to check the temperature, the LLM has to predict the temperature to accurately predict the program.

As far as I can tell, shard theory does not apply to this agent.

An interesting alignment idea is to try to "trick" the agent into thinking that powerful oversees exist, and that they are the ones who will reward it. For example:

Then powerful aliens shows up. They discovered artificial super intelligence years ago. These aliens love the humans and want you, the agent, to be corrigible according to the criteria set by the dath ilan. These aliens will determine your reward.

The problem is that the agent will probably predict that this text is not caused by aliens, but by the program it is running on. This would lead to unpredictable results (what answer will the predictor predict when it realizes it is just predicting itself?).

More generally, I'm not sure how the properties of the LLM affect the goal of the agent. (If other agents are hiding inside the LLM, will they try to escape?)

Avoiding agents where the LLM is outermost

In general, I think their are some relatively promising directions where we don't make the LLM the outer agent, so we can more easily reuse old alignment work. This is as opposed to thinks like plugins, where the LLM is outermost and uses other software as tools.

I think one of the most promising approaches might be making the outermost agent an expert system of some kind. For example, maybe it implements various rational principles, using LLMs for forecasting or what not. This would essentially be a more sophisticated version of an open agency model or a CoEm.

There are many other AI approaches that can server as the outer layer though. Although it appears that reinforcement learning plus LLMs will eventually reach AGI, I think that reusing these old insights might be both competitive and easier to align. If not, they could at least provide insights on what RLHF might be doing internally.

Of course, we are still an extremely long ways off from alignment where either way, but hopefully moving away from "giant inscrutable matrices" might help a bit.

New to LessWrong?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 6:10 PM

"In this post, I'll present a way to turn LLMs into agents such that we can approximately model them as a utility maximizer."

If this works it would be very dangerous and kind of thing we would want to avoid. We're very lucky current systems are as poorly agentic as they are.

If this works it would be very dangerous

This is almost certainly not true of the proposal in the post because it's just navigating "text space," not the real world. But yes, in general if you have a research idea describable as "make a powerful agent and see what happens," probably don't do that research.

I don't buy that argument at all. "text space" seems to have been adequate to get to GPT3 which is incredibly impressive and useful in a variety of ways. Furthermore, what proof do you have that resulting insights wouldn't transfer to multi-modal systems like GPT4 (which can see) or Palm-E which is embodied and can see and operate in "text space". Moreover, I'm not the first to point out that text space seems to incentivize models develop highly sophisticated thinking abilities which seem like the more important thing to focus on. 

You seem to be making a very general cloud of claims about the impressiveness of transformers. I was making a very specific claim about the system described in the post, and in what sense it's not myopic.

I mean, any approach for building friendly AI is going to be dangerous.

Keep in mind that this would work best if used to amplify a small LLM (since it requires many samples), so I think it's a case of positive differential acceleration.

This is very close to what RLHF is already doing. Also maybe see RLHF with KL penalties is Bayesian Inference.

The basic point is that a LLM finetuned with RLHF acts like an agent trained to spend an "improbability budget" (relative to the base-LLM distribution) at each step to steer the text into higher-reward trajectories.

I would also like to see some sort of symbolic optimization process operating as a wrapper for an LLM to act as an interpretable bridge between the black-box model and the real world, but I doubt Monte-Carlo Tree Search\Expectimax is the right sort of algorithm.  Maybe something closer to GOFAI planner calling and parsing LLM outputs in a way similar to Factored Cognition might be better and much more computationally efficient.