I'm going to be a bit more explicit about some ideas that appeared in The Parable of Predict-O-Matic. (If you don't want spoilers, read it first. Probably you should read it first anyway.)

[Note: while the ideas here are somewhat better than the ideas in the predict-o-matic story, they're equally rambling, without the crutch of the story to prop them up. As such, I expect readers to be less engaged. Unless you're especially interested in which character's remarks are true (or at least, which ones I stand by), this might be a post to skim; I don't think it has enough coherence that you need to read it start-to-finish.]

First, as I mentioned in Partial Agency, my main concern here isn't actually about building safe oracles or inner-aligned systems. My main concern is to understand what's going on. If we can build guaranteed-myopic systems, that's good for some purposes. If we can build guaranteed-non-myopic systems, that's good for other purposes. The story largely frames it as a back-and-forth about whether things will be OK / whether there will be terrible consequences; but my focus was on the more specific questions about the behavior of the system.

Second, I'm not trying to confidently stand behind any of the character's views on what will happen. The ending was partly intended to be "and no one got it right, because this stuff is very complicated". I'm very uncertain about all of this. Part of the reason why it was so much easier to write the post as a story was that I could have characters confidently explain views without worrying about adding all the relevant caveats.

Inductive Bias

Evan Hubinger pointed out to me that all the characters are talking about asymptotic performance, and ignoring inductive bias. Inner optimizers might emerge due to the inductive bias of the system. I agree; in my mind, the ending was a bit of a hat tip to this, although I hinted at gradient hacking rather than inductive bias in the actual text.

On the other hand, "inductive bias" is a complicated object when you're talking about a system which isn't 100% Bayesian.

  • You often represent inductive bias through regularization techniques which introduce incentives pulling toward 'simpler' models. This means we're back in the territory of incentives and convergence.
  • So, to talk about what a learning algorithm really does, we have to also think of the initialization and search procedure as part of the inductive bias. This makes inductive bias altogether a fairly complicated object.

Explicit Fixed-Point Selection

The very first conversation involved the intern arguing that there would be multiple valid fixed-points of prediction, and Predict-O-Matic would have to choose between them somehow.

Explicitly modeling fixed points and choosing between them is a feature of the logical induction algorithm. This feature allows us to select the best one according to some criterion, as is leveraged in When Wishful Thinking Works. As discussed later in the conversation with the mathematician, this is atypical of supervised learning algorithms. What logical induction does is very expensive: it solves a computationally difficult fixed-point finding problem (by searching exhaustively).

Other algorithms are not really "choosing a fixed point somehow". They're typically failing to guarantee a fixed point. The mathematician hinted at this by describing how algorithms would not necessarily converge to a self-fulfilling prophecy; they could just as easily go in circles or wander around randomly forever.

Think of it like fashion. Sometimes, putting a trend into common knowledge will lock it in; this was true about neck ties in business for a long time. In other instances, the popularity of a fashion trend will actually work against it, a fashion statement being ineffective if it's overdone.

So, keep in mind that different learning procedures will relate to this aspect of the problem in different ways.

Reward vs Prediction Error

The economist first compared the learning algorithm to decision markets, then later, decided prediction markets were a better analogy.

The mathematician contrasted the learning algorithm to reinforcement learning, pointing out that Predict-O-Matic always adjusted outputs to be more like historical observations, whereas reinforcement learning would more strategically optimize reward.

Both of these point at a distinction between learning general decision-making and something much narrower and much more epistemic in character. As I see it, the critical idea is that (1) the system gets information about what it should have output; (2) the learning update moves toward a modified system which would have output that. This is quite different from reinforcement learning.

In a recent post, Wei Dai mentions a similar distinction (italics added by me):

Supervised training - This is safer than reinforcement learning because we don't have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle's output during a training episode is an automated system that computes the Oracle's reward/loss, and that system is secure because it's just computing a simple distance metric (comparing the Oracle's output to the training label), then reward hacking and self-confirming predictions can't happen.

There are several things going on here, but I think Wei is trying to point at something similar to the distinction I'm thinking of. It's quite tempting to call it "supervised learning", because you get a signal telling you what you should have done. However, it's a bit fuzzy, because this also encompasses things normally called "unsupervised learning": the supervised/unsupervised distinction is often explained as modeling vs . Wikipedia:

It could be contrasted with supervised learning by saying that whereas supervised learning intends to infer a conditional probability distribution conditioned on the label of input data; unsupervised learning intends to infer an a priori probability distribution .

But many (not all) unsupervised algorithms still have the critical features we're interested in! Predicting without any context information to help still involves (1) getting feedback on what we "should have" expected, and (2) updating to a configuration which would have more expected that. We simply can't expect the predictions to be as focused, given the absence of contextual information to help. But that just means it's a prediction task on which we tend to expect lower accuracy.

I'm somewhat happy referring to this category as imitative learning. This includes supervised learning, unsupervised learning so long as it's generative (but not otherwise), and imitation learning (a paradigm which achieves similar ends as inverse reinforcement learning). Homever, the terminological overlap with 'imitation learning' is rather terrible, so I'm open to other suggestions.

It seems to me that this is a critical distinction for the myopia discussion. I hope to say more about it in future posts.

Maximizing Entropy?

The discussion of prediction markets toward the end was rather loose, in that the economist didn't deal with a lot of the other points which had been made throughout, and just threw a new model out there.

  • The mechanism of manipulation is left quite vague. In an assassination market, there are all kinds of side-channels which agents can use to accomplish their goals. But the rest of the essay had only considered the influence which Predict-O-Matic has by virtue of the predictions it makes. When writing this part, I was actually imagining side-channels, such as exploiting bugs to communicate by other means.
  • It's not clear whether the market is supposed to be myopic or non-myopic in this discussion. The argument for overall myopia was the economist's initial decision-market model, which is being cast aside here. Are manipulators expected to only throw off individual predictions to reap reward, or are they expected to implement cross-prediction strategies?
  • The disanalogy between prediction markets and a local search like gradient descent is swept under the rug. A prediction market has many agents interacting with it, so that one agent can pick up money from another by out-smarting it. Do conclusions from that setting carry over to single-hypothesis learning? It isn't clear.

Isnasene interpreted the first point by imagining that the mechanism of manipulation is still through selection of fixed points:

In the same way that self-fulfilling predictions are good for prediction strategies because they enhance accuracy of the strategy in question, self-fulfilling predictions that seem generally surprising to outside observers are even better because they lower the accuracy of competing strategies. The established prediction strategy thus systematically causes the kinds of events in the world that no other method could predict to further establish itself.

This is compatible with the assumption of myopia; we might imagine that the system still can't manipulate events through actual bad predictions, because those strategies will be undercut. Therefore, the manipulation is restricted to selecting fixed-points which are surprising.

However, there are three problems with this:

  • The "undercutting" argument relies on an assumption that there are enough different strategies considered, so that the one which undercuts the non-myopic strategy is eventually found. The argument for strategically increasing entropy relies on the reverse assumption! There's little profit in increasing entropy if others can predict that you'll do this and cash in.
    • (We might still see both effects at different times in training, however, and therefore expect major consequences of both effects.)
  • As previously discussed, we don't actually expect fixed points in general. So we have to ask whether the entropy-increasing incentive is significant more generally (unless we're specifically thinking about the logical induction algorithm or other algorithms which get you a fixed point).
  • We still have not dealt with the disanalogy between prediction markets and local-search-based learning.

So, it seems the actual situation is more complicated, and I'm not yet sure how to think about this.

'Local Search'; selection vs control

I used the term 'local search' to describe the application of gradient-descent-like updates to reduce prediction error. I have some conceptual/terminological issues with this.

Calling this 'local search' invokes the mental image of a well-defined gradient landscape which we are taking steps on, to further optimize some function. But this is the wrong mental image. The mental image is one of selection, when we're in a control setting (in my terminology). We are not making an iid assumption. We are not getting samples from a stationary but stochastic loss function, as in stochastic gradient descent.

If 'local search' were an appropriate descriptor for gradient-descent here, would it also be an appropriate descriptor for Bayesian updates? There's a tendency to think of Bayesian learning as trying to find one good hypothesis by tracking how well all of them do (which sounds like a global search), but we needn't think of it this way. The "right answer" can be a mixture over hypotheses. We can think of a Bayesian update as incrementally improving our mixture. But thinking of Bayesian updates as local search seems wrong. (So does thinking of them as global search.)

This is online learning. A gradient-descent step represents a prediction that the future will be like the past in some relevant sense, in spite of potential non-stationarity. It is not a guaranteed improvement, even in expectation -- as it would be in offline stochastic gradient descent with sufficiently small step size.

Moreover, step size becomes a more significant problem. In offline gradient descent, selecting too small a step size only means that you have to make many more steps to get where you're going. It's "just a matter of computing power". In online learning, it's a more serious problem; we want to make the appropriate-sized update to new data.

I realize there are more ways of dealing with this than tuning step size; we don't necessarily update to data by making a single gradient step. But there are problems of principal here.

What's gradient descent without a fitness landscape?

Simply put, gradient descent is a search concept, not a learning concept. I want to be able to think of it more directly as a learning concept. I want to be able to think of it as an "update", and use terminology which points out the similarity to Bayesian updates.

The Duality Remark

Vanessa asked about this passage:

The engineer was worse: they were arguing that Predict-O-Matic might maximize prediction error! Some kind of duality principle. Minimizing in one direction means maximizing in the other direction. Whatever that means.

I responded:

[I]t was a speculative conjecture which I thought of while writing.
The idea is that incentivizing agents to lower the error of your predictions (as in a prediction market) looks exactly like incentivizing them to "create" information (find ways of making the world more chaotic), and this is no coincidence. So perhaps there's a more general principle behind it, where trying to incentivize minimization of f(x,y) only through channel x (eg, only by improving predictions) results in an incentive to maximize f through y, under some additional assumptions. Maybe there is a connection to optimization duality in there.
In terms of the fictional cannon, I think of it as the engineer trying to convince the boss by simplifying things and making wild but impressive sounding conjectures. :)

If you have an outer optimizer which is trying to maximize through while being indifferent about , it seems sensible to suppose that inner optimizers will want to change to throw things off, particularly if they can get credit for then correcting to be optimal for the new . If so, then inner optimizers will generally be seeking to find -values which make the current a comparatively bad choice. So this argument does not establish an incentive to choose which makes all choices of poor.

In a log-loss setting, this would translate to an incentive to make observations surprising (for the current expectations), rather than a direct incentive to make outcomes maximum-entropy. However, iteration of this would push toward maximum entropy. Or, logical-induction-style fixed-point selection could push directly to maximum entropy.

This would be a nice example of partial agency. The system is strategically influencing and so as to maximize through channel , while minimizing through channel . What does this mean? This does not correspond to a coherent objective function at all! The system is 'learning a game-theoretic equilibrium' -- which is to say, it's learning to fight with itself, rather than optimize.

There are two different ways we can think about this. One way is to say there's an inner alignment problem here: the system learns to do something which doesn't fit any objective, so it's sort of trivially misaligned with whatever the outer objective was supposed to be. But what if we wanted this? We can think of games as a kind of generalized objective, legitimizing this behavior.

To make things even more confusing, if the only channel by which Predict-O-Matic can influence the world is via the predictions which get output, then... doesn't ? represents the 'legitimate' channel whereby predictions get combined with (fixed) observations to yield a score. represents the 'manipulative' channel, where predictions can influence the world and thus modify observations. But the two causal pathways have one bottleneck which the system has to act through, namely, the predictions made.

In any case, I don't particularly trust any of the reasoning above.

  • I didn't clarify my assumptions. What does it mean for the outer optimizer to maximize through while being indifferent about ? It's quite plausible that some versions of that will incentivise inner optimizers which optimize taking advantage of both channels, rather than the contradictory behavior conjectured above
  • I anthropomorphized the inner optimizers. In particular, I did not specify or reason about details of the learning procedure.
    • This sort of assumes they'll tend to act like full agents rather than partial agents, while yielding a conclusion which suggests otherwise.
  • This caused me to speak in terms of a fixed optimization problem, rather than a learning process. Optimizing isn't really one thing -- is a loss function which is applied repeatedly in order to learn. The real problem facing inner optimizers is an iterated game involving a complex world. I can only think of them trying to game a single if I establish that they're myopic; otherwise I should think of them trying to deal with a sequence of instances.

So, I'm still unsure how to think about all this.

New Comment
1 comment, sorted by Click to highlight new comments since:

In a recent post, Wei Dai mentions a similar distinction (italics added by me):

Supervised training—This is safer than reinforcement learning because we don’t have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle’s output during a training episode is an automated system that computes the Oracle’s reward/loss, and that system is secure because it’s just computing a simple distance metric (comparing the Oracle’s output to the training label), then reward hacking and self-confirming predictions can’t happen.

I think I've updated a bit from when I wrote this (due to this discussion). (ETA: I've now added a link from that paragraph to this comment.) Now I would say that the safety-relevant differences between SL and RL are:

  1. The loss computation for SL is typically simpler than the reward computation for RL, and therefore more secure / harder to hack, but maybe we shouldn't depend on that for safety.
  2. SL doesn't explore, so it can't "stumble onto" a way to hack the reward/loss computation like RL can. But it can still learn to hack the loss computation or the training label if the model becomes a mesa optimizer that cares about minimizing "loss" (e.g., the output of the physical loss computation) as either a terminal or instrumental goal. In other words, if reward/loss hacking happens with SL, the optimization power for it seemingly has to come from a mesa optimizer, whereas for RL it could come from either the base or mesa optimizer.