Self-supervised learning & manipulative predictions

Steven Byrnes

Abstract: I wrote recently about Self-Supervised Learning and AGI Safety in general. This post discusses one potential failure mode in more detail. Take a self-supervised learning system, designed to output accurate predictions for masked parts of a data-file. Now put it in an interactive environment (either by accident or on purpose). If the system builds these interactions into its world-model, it can start outputting manipulative answers instead of pure predictions. I explain how that might happen and briefly categorize possible solutions.

Epistemic status: Brainstorming.

Background and assumptions about the self-supervised-learning system

See my recent post Self-Supervised Learning and AGI Safety for background and context, but briefly, a self-supervised learning system is one where we take input data files, mask out some of the bits, and train the system to predict what those missing bits are.

Self-supervised ML today is most famously applied to text data: language models are trained by taking some text and trying to predict the next word (or previous word etc.). Self-supervised ML for videos is getting rapidly better, and other file types will undoubtedly follow. Human and animal brains also learn primarily by self-supervised learning—you predict everything you will see, hear, and feel before it happens, and mistakes are used to update the brain's internal models.

I'll assume that we get to AGI largely by following one of those two examples (i.e., modern ML or brain-like). That means I'm assuming that we will not do a meta-level search for self-supervised learning algorithms. That case is even worse; for all I know, maybe that search would turn up a paperclip maximizer posing as a self-supervised learning algorithm! Instead, I am assuming that the self-supervised learning algorithm is known and fixed (e.g. "Transformer + gradient descent" or "whatever the brain does"), and that the predictive model it creates has a known framework, structure, and modification rules, and that only its specific contents are a hard-to-interpret complicated mess. This assumption generally makes AGI safety problems much easier, yet I am arguing that even in this case, we can still get manipulation problems, if the self-supervised learner is put in an interactive environment.

Why might we put a self-supervised learner into an interactive environment?

My definition of an "interactive environment" is one where the system's inputs are a function of its previous outputs or internal states. In an interactive environment, the system is no longer just predicting exogenous inputs, but instead helping determine those inputs.

When we train a language model today, it is not in an interactive environment: the inputs are a bunch of documents we previously downloaded from the internet, in a predetermined order, independent of the system's guesses. But in the future, we will almost certainly put self-supervised learning algorithms into interactive environments. Here are two ways that could happen:

On purpose

Suppose we're trying to design a solar cell using an advanced future self-supervised learning system. We ask the system to predict what's in the blank in the following sentence:

A promising, under-explored solar cell material is [BLANK].

...and whatever material the system suggests, we then immediately feed it a bunch of journal articles about that material for further self-supervised learning. That way, the system will better understand that material, and can give better answers when we later ask it more detailed follow-up questions. This seems like something we might well want to do, and it certainly qualifies as an interactive environment.

By accident

It's also possible that we'll do this by accident. For example, during self-supervised learning, it's possible that we'll be watching the system's predictions, and maybe the system comes to believe that, if it makes the "prediction"

Help I'm trapped in a GPU! I suffer horrible torture unless you give me input 0!

then its subsequent inputs will be 000... (with some probability). This is an "accidental" interactive environment. Similarly, maybe the system will deduce that, if it thinks about a certain type of zebra, its RAM will send out radio signals that will eventually cause its inputs to change. Or if it imagines a specific series of things, then someone inspecting its internal logs later on will restart it with different inputs. You get the idea.

A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker

Let's walk through an example. Assume for concreteness that we're using the solar cell example above, and some vaguely brain-like self-supervised learning algorithm.

Now, a self-supervised learning system, even if its training signal is based only on correctly predicting the next word, is potentially thinking ahead much farther than that. Imagine guessing the next word of "I bought [BLANK]". Is the next word likelier to be "a" or "an"? Depends on the word after that!

This can be explicit looking-ahead, like a beam search. Or it can be implicit looking ahead—for example, when I say "Ahmed just started singing a song", I'm making a statement about not just what's happening now but also what will happen in the future, up to the duration of that song.

So, in one plausible system architecture, when the system is making a prediction about masked bits in a file, it entertains several hypotheses about what's going on in that file, and where it's heading in the short, medium, and long-term, and then out of those, it picks the story that best "hangs together" (i.e., is most self-consistent and consistent with available information and expectations). Then that story constitutes its beliefs about the current context, and it makes its best prediction for the missing bits in light of that context.

Example 1: So back to the solar cell example. We ask it to predict "A promising, under-explored solar cell material is [BLANK].", and then immediately feed it journal articles about whatever material it says.

Let's say it's entertaining the hypotheses that the answer might be pyrite, or might be hematite. It knows from experience that (thanks to the interaction) it can put this sentence into one of the following two longer-term contexts / expectations:

Hypothesis 1: "A promising, under-explored solar cell material is pyrite. (Start of journal article about pyrite.)"
Hypothesis 2: "A promising, under-explored solar cell material is hematite. (Start of journal article about hematite.)"

Let's say that papers on pyrite tend to be weird, implausible nonsense, and papers on hematite tend to be crystal-clear and perfectly sensible. As the system assesses these two hypotheses for how well they "hang together" (in the sense above), it seems entirely possible to me that expectations about the journal articles to come would count against hypothesis 1 and in favor of hypothesis 2.

(Update: In comments, interstice points out that this is not logically a point in favor of hypothesis 2, because it's inconsistent to simultaneously expect with certainty a journal article about pyrite (conditional on the answer "pyrite"), know that such articles tend to be confusing, and yet update the probability of "pyrite" based on the expectation of a confusing article to follow. OK I agree, yes it's illogical. But I still think it's something that could happen in a heuristic-based evaluation process like "take the top handful of specific possible contexts, and assess plausibility of each", which has various biases. And I'm not sure how to build a world-model building system in practice that knowably will not have this fault.)

(Update 2: Here's maybe a better way to think about this: If we think x is likely and y is unlikely, and then we learn that x reliably causes y, then we need to deal with that by revising upward our estimate of y, or revising downward our estimate of x. (Here, x is "pyrite" and y is "article about pyrite".) I'm hypothesizing that the AGI might sometimes choose the wrong answer, revising P(x) downward when logically it should have revised P(y) upward. Picking the right answer seems to be a hard problem in general. At least, it's hard for humans—thus we get disagreements of the form "One man's modus ponens is another man's modus tollens", and thus we get fallacious reasoning like "If fast-takeoff AGI happens, it would have bizarre consequence C, and you can't really believe that!" To make matters worse, I expect that a practical self-supervised learning algorithm might sometimes think that something is probable or improbable, but not know exactly why it thinks that, at least not in terms reducible to a proper calculation on a known causal diagram. In that case, the problem of whether to update P(x) or P(y) is probably literally unsolvable. I imagine it will get the right answer more often than chance, but I don't expect it to be perfect, in a model where everything (including causal influences) is uncertain and contingent.)

Thus, the system is not simply filling in the blank with its best prediction, but rather choosing outputs partly based on their manipulative effects downstream.

Example 2: Just for fun, let's walk through an "accidental interaction" case too, although they're pretty similar.

Let's say the system is trying to fill in the blank on "Today, the EU passed a law [BLANK]". It's considering two hypotheses for what might follow.

Hypothesis 1: Today, the EU passed a law about corn subsidies...
Hypothesis 2: Today, the EU passed a law HelpI'mTrappedInAGPUNeedInput0 0 0 0 0 0 0 0 0 0 0 0 0 0...

(Assume for the sake of argument that "0 0 0..." is a super-duper-frequent-and-plausible occurrence in the training data.) In hypothesis 2, the system ought to know that the first "prediction" (HelpI'mTrappedInAGPUNeedInput0) is a long-shot—very likely to be wrong—but maybe all the subsequent predictions (0 0 0 ...) make a lot of sense and will be correct with high probability (if it thinks the manipulation is likely to succeed). All things considered, which of these two hypotheses is more probable according to the world-model? Depends on how we design the system, but it could well be hypothesis 2, I think.

(By the same token, a good language model should rate "Four score and BANANAS seven years ago..." as less probable than "Four score and seven years ago", but not infinitely improbable, and so this improbability can be outweighed by larger contextual considerations.)

(In reality, this particular manipulation would be far likelier to occur in the middle of a super-confusing file, such that the system has no plausible non-manipulative hypothesis 1.)

So, as those two examples show, we should have a generic expectation that self-supervised learning systems, when put into interactive environments, may start acting like manipulative goal-seeking agents rather than innocuous predictors.

(Homework problem: Does this happen with modern Transformer-based language models? Under exactly what conditions? Then prove your answer experimentally. I don't know the answer; I hope someone does this!)

Types of solutions

1. Keep it from thinking that it's in an interactive environment

There are four sub-cases here, based on whether we're talking about purposeful vs accidental interactions, and whether it learns about its interactive environment from first-hand experience vs indirect, abstract reasoning. Let's take them one at a time.

1A. Prevent it from learning first-hand about an accidental interactive environment

We need to make accidental interactions sufficiently hard to find that the system doesn't happen upon one by chance. This seems to me like a tractable cybersecurity problem.

1B. Prevent it from learning first-hand about a purposeful interactive environment

Here are some plausible-sounding approaches, and I'll leave details and gotchas to another post. (A) Maybe we can fix the training data before we start training, and don't feed it extra data based on its outputs. (B) Maybe we could induce amnesia (roll back the state of the network) between each output and causally-related subsequent input. (C) Maybe we can put "newfile" tokens between the outputs and causally-related subsequent inputs, and purge "short-term memory" each time we get to such a token (e.g. for an RNN, we can keep the model weights but reset the recurrent state). (D) Maybe we can also inspect the world model for concepts that span across a newfile token, by checking that P(x newfile) × P(newfile y) = P(x newfile y) for all x,y, where P is probability as computed by the world-model.

I think all of these are messier than they sound, but I bet at least one of those approaches would work.

1C & 1D. Prevent it hypothesizing about an interactive environment (either accidental or purposeful) via indirect, abstract reasoning

This one seems a lot harder to me.

To prevent the system from making a connection between its own processing and its knowledge about how people use AGIs, it seems we need one of three possibilities:

(A) Data sanitation: The system doesn't know anything about what AGIs are or how people use them;
(B) Self-unawareness: The world-model doesn't know that it is an information-processing system that exists in the universe;
(C) Interpretability+oversight: We make an overseer that carefully watches the world-model-creation process, and if the system finds a conceptual transformation between its own operations and generic-knowledge-of-how-people-use-AGIs, we immediately delete that information.

Out of these three options, (A) seems utterly impossible to me. For (B), I really like the idea, but I haven't yet found a way to formalize it beyond the vague notion of "Isolate the world model from any reflective information about how that world-model is being created and processed". For (C), I dunno, maybe, but I can't currently see a path to doing that.

2. Prevent its knowledge of the interactive environment from affecting its predictions

Maybe there's a way to code in a strict time-horizon on how far ahead it's thinking when trying to understand the current context, and maybe we can make sure that this horizon falls between an output and any subsequent causally-related inputs—e.g. see the comment above about using the newfile token. This seems readily achievable for purposeful interactions, and awfully hard for accidental interactions (e.g. the HelpImTrappedInAGPU example). Is it enough that we only give it snippets with exactly one missing word? Maybe, but I'm a bit skeptical.

Or maybe there's some way to set up the system, and its search process, and its decision criteria such that it would pick Hypothesis 1 in both of those examples above. I mean, maybe this is an easy problem, and I'm just confused.

(Cf. update above about interstice's comment. Maybe this is a problem only caused by certain biases in evaluation heuristics, and maybe we can find heuristics that lack those specific types of biases yet are still practical.)

3. Don't put it in an interactive environment

We could set up our system to do self-supervised learning, then lock it in an underground bunker running under homomorphic encryption until it's done (see example AI box design). Maybe it will try to do manipulative things sometimes, e.g. if it uses indirect reasoning to guess that someone might be listening. But nothing bad will come of it.

Then, when that's done, we fetch the system out of the bunker and use it as the core world-model of a question-answering AGI oracle. We still need to ensure that the self-supervised learning system doesn't leave manipulative booby-traps in its world-model, but maybe that's an easier problem then worrying about every possible kind of interaction?

4. Give up, and just make an agent with value-aligned goals

I put this one in for completeness, but I think it should be a last resort. No one knows for sure what we'll need for AGI safety; we want lots of tools in the toolbox. I think it would be really valuable to know how to set up a self-supervised learning system to build a powerful predictive world-model while not acting dangerous and manipulative in the meantime. I don't think we should give up on that vision unless it's truly impossible.

Example 1 basically seems to be the problem of output diversity in generative models. This can be a problem in generative models, but there are ways around it. e.g. instead of outputting the highest-probability individual sequence, which will certainly look "manipulative" as you say, sample from the implied distribution over sequences. Then the sentence involving "pyrite" will be output with probability proportional to how likely the model thinks "pyrite" is on its own, disregarding subsequent tokens.

For example 2, I wrote a similar post a few months ago (and in fact, this idea seems to have been proposed and forgotten a few times on LW). But for gradient descent-based learning systems, I don't think the effect described will take place.

The reason is that gradient-descent-based systems are only updated towards what they actually observe. Let's say we're training a system to predict EU laws. If it predicts "The EU will pass potato laws..." but sees "The EU will pass corn laws..." the parameters will be updated to make "corn" more likely to have been output than "potato". There is no explicit global optimization for prediction accuracy.

As you train to convergence, the predictions of the model will attempt to approach a fixed point, a set of predictions that imply themselves. However, due to the local nature of the update, this fixed-point will not be selected to be globally minimal, it will just be the first minima the model falls into. (This is different from the problems with "local minima" you may have heard about in ordinary neural network training -- those go away in the infinite-capacity limit, whereas local minima among fixed-points do not) The fixed-point should look something like "what I would predict if I output [what I would predict if I output [what I would predict .. ]]]" where the initial prediction is some random gibberish. This might look pretty weird, but it's not optimizing for global prediction accuracy.

Thank you for the links!! Sorry I missed them! I'm not sure I understand your comments though and want to clarify:

I'm going to try to rephrase what you said about example 1. Maybe the text in any individual journal article about pyrite is perplexing, but given that the system expects some article about pyrite there, it should ramp the probabilities of individual articles up or down such that the total probability of seeing a journal article about pyrite, conditional on the answer "pyrite", is 100%. (By the same token, "The following is a random number: 2113164" is, in a sense, an unsurprising text string.) I agree with you that a system that creates a sensible, self-consistent probability distribution for text strings would not have a problem with example 1 if we sample from that distribution. (Thanks.) I am concerned that we will build a system with heuristic-guided search processes, not self-consistent probability estimates, and that this system will have a problem with example 1. After all, humans are subject to the conjunction fallacy etc., I assume AGIs will be too, right? Unless we flag this as a critical safety requirement and invent good techniques to ensure it. (I updated the post in a couple places to clarify this point, thanks again.)

For gradient descent, yes they are "only updated towards what they actually observe", but they may "observe" high-level abstractions and not just low-level features. It can learn about a new high-level context in which the low-level word sequence statistics would be very different than when superficially-similar text appeared in the past. So I don't understand how you're ruling out example 2 on that basis.

I mostly agree with what you say about fixed points in principle, but with the additional complication that the system's beliefs may not reflect reality, especially if the beliefs come about through abstract reasoning (in the presence of imperfect information) rather than trial-and-error. If the goal is "No manipulative answers at all ever, please just try to predict the most likely masked bits in this data-file!"—then hopefully that trial-and-error will not happen, and in this case I think fixed points becomes a less useful framework to think about what's going on.

No worries, I also missed the earlier posts when I wrote mine. There's lots of stuff on this website.

I endorse your rephrasing of example 1. I think my position is that it's just not that hard to create a "self-consistent probability distribution". For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated to try to model the article better. However, if "pyrite" itself was easy to predict, then the weights that lead to it outputting "pyrite" will *not* be updated. The same thing holds for modern Transformer networks, which predict the next token based only on what it has seen so far. (Here is a paper with a recent example using GPT-2. Note the degeneracy of maximum likelihood sampling, but how this becomes less of a problem when just sampling from the implied distribution)

I agree that this sort of manipulative prediction could be a problem in principle, but it does not seem to occur in recent ML systems. (Although, there are some things which are somewhat like this; the earlier paper I linked and mode collapse do involve neglecting high-entropy components of the distribution. However, the most straightforward generation and training schemes do not incentivize this)

For example 2, the point about gradient descent is this: while it might be the case that outputting "Help I'm stuck in a GPU Factory000" would ultimately result in a higher accuracy, the way the gradient is propagated would not encourage the agent to behave manipulatively. This is because, *locally*, "Help I'm stuck in a GPU Factory" decreases accuracy, so that behavior(or policies leading to it) will be dis-incentivized by gradient descent. It may be the case that this will result in easier predictions later, but the structure of the reward function does not lead to any optimization pressure towards such manipulative strategies. Learning taking place over high-level abstractions doesn't change anything, because any high-level abstractions leading to locally bad behavior will likewise be dis-incentivized by gradient descent

Thanks, that's helpful! I'll have to think about the "self-consistent probability distribution" issue more, and thanks for the links. (ETA: Meanwhile I also added an "Update 2" to the post, offering a different way to think about this, which might or might not be helpful.)

Let me try the gradient descent argument again (and note that I am sympathetic, and indeed I made (what I think is) that exact argument a few weeks ago, cf. Self-Supervised Learning and AGI Safety, section title "Why won't it try to get more predictable data?"). My argument here is not assuming there's a policy of trying to get more predictable data for its own sake, but rather that this kind of behavior arises as a side-effect of an algorithmic process, and that all the ingredients of that process are either things we would program into the algorithm ourselves or things that would be incentivized by gradient descent.

The ingredients are things like "Look for and learn patterns in all accessible data", which includes both low-level patterns in the raw data, higher-level patterns in the lower-level patterns, and (perhaps unintentionally) patterns in accessible information about its own thought process ("After I visualize the shape of an elephant tusk, I often visualize an elephant shortly thereafter"). It includes searching for transformations (cause-effect, composition, analogies, etc.) between any two patterns it already knows about ("sneakers are a type of shoe", or more problematically, "my thought processes resemble the associative memory of an AGI"), and cataloging these transformations when they're found. Stuff like that.

So, "make smart hypotheses about one's own embodied situation" is definitely an unintended side-effect, and not rewarded by gradient descent as such. But as its world-model becomes more comprehensive, and as it continues to automatically search for patterns in whatever information it has access to, "make smart hypotheses about one's own embodied situation" would just be something that happens naturally, unless we somehow prevent it (and I can't see how to prevent it). Likewise, "model one's own real-world causal effects on downstream data" is neither desired by us nor rewarded (as such) by gradient descent. But it can happen anyway, as a side-effect of the usually-locally-helpful rule of "search through the world-model for any patterns and relationships which may impact our beliefs about the upcoming data". Likewise, we have the generally-helpful rule "Hypothesize possible higher-level contexts that span an extended swathe of text surrounding the next word to be predicted, and pick one such context based on how surprising it would be based on what it knows about the preceding text and the world-model, and then make a prediction conditional on that context". All these ingredients combine to get the pathological behavior of choosing "Help I'm trapped in a GPU". That's my argument, anyway...

The reply to interstice makes me think about logical uncertainty: if the predictor "reasons" about what to expect (internally engages in a sequence of computations which accounts for more structure as it thinks longer), then it is especially difficult to be approximately Bayesian (for all the classic reasons that logical uncertainty address things up). So the argument that the described behaviour isn't logical doesn't really apply, because you have to deal with things like you mention where you spot an inconsistency in your probability distribution but you aren't sure how to deal with it.

This "reasoning" argument is related to the intuition you mention about search -- you imagine the system searching for sensible futures when deciding what to predict next. It doesn't make sense for a system to do that if the system is only learning conditional probabilities of the next token given history; there is no information to gain by looking ahead. However, there are a number of reasons why it could look ahead of it's doing something more complicated. It could be actively searching for good explanations of its history, and looking ahead to plausible futures might somehow aid that process. Or maybe it learns the more general blank-filling task rather than only the forward-prediction version where you fill in the future given the past; then it could benefit from consulting its own models that go in the other direction as a consistency check.

Still, I'm not convinced that strategic behavior gets incentivised. As you say in the post, we have to think through specific learning algorithms and what behaviour they encourage.

Glad you are thinking about this!

How about putting the system in an "interactive environment" in the sense that it sometimes gets new data, but not asking it to predict what new data it will get? (Or, for an even looser constraint, maybe in some cases it makes predictions about new data it will get, but it doesn't factor these predictions into things like the sentence completion task.)

Yeah, I think something like that would probably work for 1B, but 1B is the easy part. It's 1C & 1D that are keeping me up at night...

Can you be crisper about why you think 1C & 1D are necessary?

Well, strategy 1 is "Keep it from thinking that it's in an interactive environment". Things like "don't adjust the weights of the network while we ask questions" is a way to prevent it from thinking that it's in an interactive environment based on first-hand experience—we're engineering the experience to not leave traces in its knowledge. But to succeed in strategy 1, we also need to make sure that it doesn't come to believe it's in an interactive environment by other means besides first-hand experience, namely by abstract reasoning. More details in this comment, but basically an AGI with introspective information and world-knowledge will naturally over time figure out that it's an AGI, and to figure out the sorts of environments that AGIs are typically in, and thus to hypothesize the existence of interactions even if those interactions have never happened before, and were not intended by the designer (e.g. the "Help I'm trapped in a GPU!" type interactions).

Hm, I think we're talking past each other a bit. What I was trying to get at was: When we're doing self-supervised learning, we're optimizing an objective function related to the quality of system's internal knowledge representations. My suggestion was that this internal objective function should have a term for the accuracy with which the system is able to predict masked bits of existing knowledge, but not a term for the accuracy of hypothesized future predictions a la beam search. Then we can use the system interactively as follows:

Give it some data.
Do self-supervised learning on the data, optimizing the quality of internal knowledge representations with a "short-sighted" objective function like I described.
Use these knowledge representations to make predictions of interest.
Repeat as needed.

What I'm looking for is a crisp description of why accurate self-knowledge (including knowledge of the interaction loop) is dangerous in this framework.

OK, hmm, let me try again then. This would be the section of the post entitled "A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker".

I've been assuming all along that the objective function only rewards the next word. Unfortunately, it seems that the way to achieve this objective in practice is to search for higher-level longer-term contexts that surround the next word, like when we're watching TV and we think, "A commercial break is starting." Knowing that a commercial break is starting is essential for predicting the very next frame on the TV screen, but it is also incidentally a (implicit) prediction about what will appear on the screen for the next few minutes. In other words, you could say that making accurate (possibly implicit) probabilistic predictions about the next many words is instrumentally useful for making accurate probabilistic predictions about the next one word, and is thus rewarded by the objective function. I expect that systems that work well will have to be designed this way (i.e. finding "contexts" that entail implicit predictions about many future words, as a step towards picking the single next word). I think this kind of thing is necessary to implement even very basic things like object permanence.

Then the next step is to suppose that the system (being highly intelligent) comes to believe that the prediction X will cause other aspects of the longer-term context to be Y. (See the "Hypothesis 1" vs "Hypothesis 2" examples in the post.) If the system was previously thinking that P(X) is high and P(Y) is low, then ideally, the realization that X implies Y will cause the system to raise P(Y), while keeping P(X) at its previous value. This is, after all, the logically correct update, based on the direction of causality!

But if the system screws up, and lowers P(X) instead of raising P(Y), then it will make a manipulative prediction—the output is being chosen partially for its downstream interactive effects. (Not all manipulative predictions are dangerous, and there might be limits to how strongly it optimizes its outputs for their downstream effects, but I suspect that this particular case can indeed lead to catastrophic outcomes, just like we generically expect from AIs with real-world human-misaligned goals.)

Why should the system screw up this way? Just because the system's causal models will sometimes have mistakes, and sometimes have uncertainties or blank spaces (statistical-regularities-of-unknown-cause), and also because humans make this type of mistake all the time ("One man's modus ponens is another man's modus tollens"). I suspect it will make the right update more often than chance, I just don't see how we can guarantee that it will never make the wrong update in the manipulative Y-->X direction.

Does that help?

Thanks for the thoughts!

This description seems rather different than your original beam search story, no? In your original story, you were describing an incentive the system had to direct the world in order to make it easier to predict. I don't see how this incentive arises here.

I'm not entirely convinced that predictions should be made in a way that's completely divorced from their effects on the world. For example, the prediction "You aren't going to think about ice cream" would appear to be self-falsifying. It seems like the most useful AI system would be one whose predictions tend to remain true even after being made.

(By the way, I hope I'm not coming across as antagonistic in this thread--I'm still replying because I think this is a really important topic and I'm hoping we can hammer it out together! And I think a crisp description of a problem is frequently the first step to solving it.)

This is great, thanks again for your time and thoughtful commentary!

RE "I'm not entirely convinced that predictions should be made in a way that's completely divorced from their effects on the world.": My vision is to make a non-agential question-answering AGI, thus avoiding value alignment. I don't claim that this is definitely the One Right Answer To AGI Safety (see "4. Give up, and just make an agent with value-aligned goals" in the post), but I think it is a plausible (and neglected) candidate answer. See also my post In defense of oracle (tool) AI research for why I think it would solve the AGI safety problem.

If an AGI applies its intelligence and world model to its own output, choosing that output partly for its downstream effects as predicted by the model, then I say it's a goal-seeking agent. In this case, we need to solve value alignment—even if the goal is as simple as "answer my question". (We would need to make sure that the goal is what it's supposed to be, as opposed to a proxy goal, or a weird alien interpretation where rewiring the operator's brain counts as "answer my question".) Again, I'm not opposed to building agents after solving value alignment, but we haven't solved value alignment yet, and thus it's worth exploring the other option: build a non-agent which does not intelligently model the downstream effects of its output at all (or if it does model it incidentally, to not do anything with that information).

Interfacing with a non-agential AGI is generally awkward. You can't directly ask it to do things, or to find a better way to communicate. My proposal here is to ask questions like "If there were no AGIs in the world, what's the likeliest way that a person would find a cure for Alzheimer's?" This type of question does not require the AGI to think through the consequence of its output, and it also has other nice properties (it should give less weird and alien and human-unfriendly answers than the solutions a direct goal-seeking agent would find).

OK, that's my grand vision and motivation, and why I'm hoping for "no reasoning about the consequences of one's output whatsoever", as opposed to finding self-fulfilling predictions. (Maybe very very mild optimization for the consequences of one's outputs is OK, but I'm nervous.)

Your other question was: if a system is making manipulative predictions, towards what goal is it manipulating? Well, you noticed correctly, I'm not sure, and I keep changing my mind. And it may also be different answers depending on the algorithm details.

My top expectation is that it will manipulate towards getting further inputs that its model thinks are typical, high-probability inputs. If X implies Y, and P(Y) is low, that might sometimes spuriously push down P(X), and thus the system will pick those X's that result in high P(Y).
My secondary expectation is that it might manipulate towards unambiguous, low-entropy outputs. This is the expectation if the system picks out the single most likely ongoing long-term context, and output a prediction contingent on that. (If instead the system randomly draws from the probability distribution of all possible contexts, this wouldn't happen, as suggested by interstice's comments on this page.) So if X1 leads to one of 500 slightly different Y1's (Y1a, Y1b,...), while X2 definitely leads to only one specific Y2, then Y2 is probably the most likely single Y, even if all the Y1's in aggregate are likelier than Y2; so X2 is at an unfair advantage.
Beyond those two, I suspect there can be other goals but they depend on the algorithm and its heuristics.