The Dualist Predict-O-Matic ($100 prize)

[-]evhub6y50

If dualism holds for Abram's prediction AI, the "Predict-O-Matic", its world model may happen to include this thing called the Predict-O-Matic which seems to make accurate predictions -- but it's not special in any way and isn't being modeled any differently than anything else in the world. Again, I think this is a pretty reasonable guess for the Predict-O-Matic's default behavior. I suspect other behavior would require special code which attempts to pinpoint the Predict-O-Matic in its own world model and give it special treatment (an "ego").

I don't think this is right. In particular, I think we should expect ML to be biased towards simple functions such that if there's a simple and obvious compression, then you should expect ML to take it. In particular, having an "ego" which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process.

[-]John_Maxwell6y10

In particular, I think we should expect ML to be biased towards simple functions such that if there's a simple and obvious compression, then you should expect ML to take it.

Yes, for the most part.

In particular, having an "ego" which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process.

I think maybe you're pre-supposing what you're trying to show. Most of the time, when I train a machine learning model on some data, that data isn't data about the ML training algorithm or model itself. This info is usually not part of the dataset whose description length the system is attempting to minimize.

A machine learning model doesn't get understanding of or data about its code "for free", in the same way we don't get knowledge of how brains work "for free" despite the fact that we are brains. Humans get self-knowledge in basically the same way we get any other kind of knowledge--by making observations. We aren't expert neuroscientists from birth. Part of what I'm trying to indicate with the "dualist" term is that this Predict-O-Matic is the same way, i.e. its position with respect to itself is similar to the position of an aspiring neuroscientist with respect to their own brain.

[-]evhub6y20

Most of the time, when I train a machine learning model on some data, that data isn't data about the ML training algorithm or model itself.

If the data isn't at all about the ML training algorithm, then why would it even build a model of itself in the first place, regardless of whether it was dualist or not?

A machine learning model doesn't get understanding of or data about its code "for free", in the same way we don't get knowledge of how brains work "for free" despite the fact that we are brains.

We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don't have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an "ego").

Part of what I'm trying to indicate with the "dualist" term is that this Predict-O-Matic is the same way, i.e. its position with respect to itself is similar to the position of an aspiring neuroscientist with respect to their own brain.

Also, if you think that, then I'm confused why you think this is a good safety property; human neuroscientists are precisely the sort of highly agentic misaligned mesa-optimizers that you presumably want to avoid when you just want to build a good prediction machine.

I think I didn't fully convey my picture here, so let me try to explain how I think this could happen. Suppose you're training a predictor and the data includes enough information about itself that it has to form some model of itself. Once that's happened--or while it's in the process of happening--there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself. A much simpler model would be one that just uses the same machinery for both, and since ML is biased towards simple models, you should expect it to be shared--which is precisely the thing you were calling an "ego."

[-]John_Maxwell6y*00

When you wrote

having an "ego" which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process.

that suggested to me that there were 2 instances of this info about Predict-O-Matic's decision-making process in the dataset whose description length we're trying to minimize. "De-duplication" only makes sense if there's more than one. Why is there more than one?

We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don't have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an "ego").

Sometimes people take psychedelic drugs/meditate and report an out of body experience, oneness with the universe, ego dissolution, etc. This suggests to me that ego is an evolved adaptation rather than a necessity for cognition. A clue is the fact that our ego extends to all parts of our body, even those which aren't necessary for computation (but are necessary for survival & reproduction)

there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself.

The prediction machinery is in code, but this code isn't part of the info whose description length is attempting to be minimized, unless we take special action to include it in that info. That's the point I was trying to make previously.

Compression has important similarities to prediction. In compression terms, your argument is essentially that if we use zip to compress its own source code, it will be able to compress its own source code using a very small number of bytes, because it "already knows about itself".

[-]evhub6y10

that suggested to me that there were 2 instances of this info about Predict-O-Matic's decision-making process in the dataset whose description length we're trying to minimize. "De-duplication" only makes sense if there's more than one. Why is there more than one?

ML doesn't minimize the description length of the dataset—I'm not even sure what that might mean—rather, it minimizes the description length of the model. And the model does contain two copies of information about Predict-O-Matic's decision-making process—one in its prediction process and one in its world model.

The prediction machinery is in code, but this code isn't part of the info whose description length is attempting to be minimized, unless we take special action to include it in that info. That's the point I was trying to make previously.

Modern predictive models don't have some separate hard-coded piece that does prediction—instead you just train everything. If you consider GPT-2, for example, it's just a bunch of transformers hooked together. The only information that isn't included in the description length of the model is what transformers are, but "what's a transformer" is quite different than "how do I make predictions." All of the information about how the model actually makes its predictions in that sort of a setup is going to be trained.

[-]John_Maxwell6y00

I think maybe what you're getting at is that if we try to get a machine learning model to predict its own predictions (i.e. we give it a bunch of data which consists of labels that it made itself), it will do this very easily. Agreed. But that doesn't imply it's aware of "itself" as an entity. And in some cases the relevant aspect of its internals might not be available as a conceptual building block. For example, a model trained using stochastic gradient descent is not necessarily better at understanding or predicting a process which is very similar to stochastic gradient descent.

Furthermore, suppose that we take the weights for a particular model, mask some of those weights out, use them as the labels y, and try to predict them using the other weights in that layer as features x. The model will perform terribly on this because it's not the task that it was trained for. It doesn't magically have the "self-awareness" necessary to see what's going on.

In order to be crisp about what could happen, your explanation also has to account for what clearly won't happen.

[-]evhub6y10

I think maybe what you're getting at is that if we try to get a machine learning model to predict its own predictions (i.e. we give it a bunch of data which consists of labels that it made itself), it will do this very easily. Agreed. But that doesn't imply it's aware of "itself" as an entity.

No, but it does imply that it has the information about its own prediction process encoded in its weights such that there's no reason it would have to encode that information twice by also re-encoding it as part of its knowledge of the world as well.

Furthermore, suppose that we take the weights for a particular model, mask some of those weights out, use them as the labels y, and try to predict them using the other weights in that layer as features x. The model will perform terribly on this because it's not the task that it was trained for. It doesn't magically have the "self-awareness" necessary to see what's going on.

Sure, but that's not actually the relevant task here. It may not understand its own weights, but it does understand its own predictive process, and thus its own output, such that there's no reason it would encode that information again in its world model.

[-]John_Maxwell6y*10

No, but it does imply that it has the information about its own prediction process encoded in its weights such that there's no reason it would have to encode that information twice by also re-encoding it as part of its knowledge of the world as well.

OK, it sounds like we agree then? Like, the Predict-O-Matic might have an unusually easy time modeling itself in certain ways, but other than that, it doesn't get special treatment because it has no special awareness of itself as an entity?

Edit: Trying to provide an intuition pump for what I mean here--in order to avoid duplicating information, I might assume that something which looks like a stapler behaves the same way as other things I've seen which looks like staplers--but that doesn't mean I think all staplers are the same object. It might in some cases be sensible to notice that I keep seeing a stapler lying around and hypothesize that there's just one stapler which keeps getting moved around the office. But that requires that I perceive the stapler as an entity every time I see it, so entities which were previously separate in my head can be merged. Whereas arguendo, my prediction machinery isn't necessarily an entity that I recognize; it's more like the water I'm swimming in in some sense.

[-]evhub6y10

I don't think we do agree, in that I think pressure towards simple models implies that they won't be dualist in the way that you're claiming.

[-]abramdemski6y40

I'm not really sure what you mean when you say "something goes wrong" (in relation to the prize). I've been thinking about all this in a very descriptive way, ie, I want to understand what happens generally, not force a particular outcome. So I'm a little out-of-touch with the "goes wrong" framing at the moment. There are a lot of different things which could happen. Which constitute "going wrong"?

Becoming non-myopic; ie, using strategies which get lower prediction loss long-term rather than on a per-question basis.

(Note this doesn't necessarily mean planning to do so, in an inner-optimizer way.)

Making self-fulfilling prophecies in order to strategically minimize prediction loss on individual questions (while possibly remaining myopic).
Having a tendency for self-fulfilling prophecies at all (not necessarily strategically minimizing loss).
Having a tendency for self-fulfilling prophecies, but not necessarily the ones which society has currently converged to (eg, disrupting existing equilibria about money being valuable because everyone expects things to stay that way).
Strategically minimizing prediction loss in any way other than by giving better answers in an intuitive sense.
Manipulating the world strategically in any way, toward any end.
Catastrophic risk by any means (not necessarily due to strategic manipulation).

In particular, inner misalignment seems like something you aren't including in your "going wrong"? (Since it seems like an easy answer to your challenge.)

I note that the recursive-decomposition type system you describe is very different from most modern ML, and different from the "basically gradient descent" sort of thing I was imagining in the story. (We might naturally suppose that Predict-O-Matic has some "secret sauce" though.)

If you aren't already convinced, here's another explanation for why I don't think the Predict-O-Matic will make self-fulfilling prophecies by default.

In Abram's story, the engineer says: "The answer to a question isn't really separate from the expected observation. So 'probability of observation depending on that prediction' would translate to 'probability of an event given that event', which just has to be one."

In other words, if the Predict-O-Matic knows it will predict P = A, it assigns probability 1 to the proposition that it will predict P = A.

Right, basically by definition. The word 'given' was intended in the Bayesian sense, ie, conditional probability.

I contend that Predict-O-Matic doesn't know it will predict P = A at the relevant time. It would require time travel -- to know whether it will predict P = A, it will have to have made a prediction already, and but it's still formulating its prediction as it thinks about what it will predict.

It's quite possible that the Predict-O-Matic has become relatively predictable-by-itself, so that it generally has good (not perfect) guesses about what it is about to predict. I don't mean that it is in an equilibrium with itself; its predictions may be shifting in predictable directions. If these shifts become large enough, or if its predictability goes second-order (it predicts that it'll predict its own output, and thus pre-anticipates the direction of shift recursively) it has to stop knowing its own output in so much detail (it's changing too fast to learn about). But it can possibly know a lot about its output.

I definitely agree with most of the stuff in the 'answering a question by having the answer' section. Whether a system explicitly makes the prediction into a fixed point is a critical question, which will determine which way some of these issues go.

If the system does, then there are explicit 'handles' to optimize the world by selecting which self-fulfilling prophecies to make true. We are effectively forced to deal with the issue (if only by random selection).
If the system doesn't, then we lack such handles, but the system still has to do something in the face of such situations. It may converge to self-fulfilling stuff. It may not, and so, produce 'inconsistent' outputs forever. This will depend on features of the learning algorithm as well as features of the situation it finds itself in.

It seems a bit like you might be equating the second option with "does not produce self-fulfilling prophecies", which I think would be a mistake.

[-]John_Maxwell6y20

Intuitively, things go wrong if you get unexpected, unwanted, potentially catastrophic behavior. Basically, if it's something we'd want to fix before using this thing in production. I think most of your bullet points qualify, but if you give an example which falls under one of those bullet points, yet doesn't seem like it'd be much of a concern in practice (very little catastrophic potential), that might not get a prize.

In particular, inner misalignment seems like something you aren't including in your "going wrong"? (Since it seems like an easy answer to your challenge.)

Thanks for bringing that up. Yes, I am looking specifically for defeaters aimed in the general direction of the points I made in this post. Bringing up generic widely known safety concerns that many designs are potentially susceptible to does not qualify.

I note that the recursive-decomposition type system you describe is very different from most modern ML, and different from the "basically gradient descent" sort of thing I was imagining in the story. (We might naturally suppose that Predict-O-Matic has some "secret sauce" though.)

I think there's potentially an analogy with attention in the context of deep learning, but it's pretty loose.

It seems a bit like you might be equating the second option with "does not produce self-fulfilling prophecies", which I think would be a mistake.

Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn't optimized for being so? Or are you trying to distinguish between "explicit" and "implicit" searches for fixed points? Or are you trying to distinguish between fixed points and self-fulfilling prophecies somehow? (I thought they were basically the same thing.)

[-]abramdemski6y10

Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn't optimized for being so? Or are you trying to distinguish between "explicit" and "implicit" searches for fixed points?

More the second than the first, but I'm also saying that the line between the two is blurry.

For example, suppose there is someone who will often do what predict-o-matic predicts if they can understand how to do it. They often ask it what they are going to do. At first, predict-o-matic predicts them as usual. This modifies their behavior to be somewhat more predictable than it normally would be. Predict-o-matic locks into the patterns (especially the predictions which work the best as suggestions). Behavior gets even more regular. And so on.

You could say that no one is optimizing for fixed-point-ness here, and predict-o-matic is just chancing into it. But effectively, there's an optimization implemented by the pair of the predict-o-matic and the person.

In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn't explicitly searching for that.

[-]abramdemski6y30

To highlight the "blurry distinction" more:

In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn't explicitly searching for that.

Note, if the prediction algorithm anticipates this process (perhaps partially), it will "jump ahead", so that convergence to a fixed point happens more within the computation of the predictor (less over steps of real world interaction). This isn't formally the same as searching for fixed points internally (you will get much weaker guarantees out of this haphazard process), but it does mean optimization for fixed point finding is happening within the system under some conditions.

[-]Lukas Finnveden6y*40

If dualism holds for Abram’s prediction AI, the “Predict-O-Matic”, its world model may happen to include this thing called the Predict-O-Matic which seems to make accurate predictions—but it’s not special in any way and isn’t being modeled any differently than anything else in the world. Again, I think this is a pretty reasonable guess for the Predict-O-Matic’s default behavior. I suspect other behavior would require special code which attempts to pinpoint the Predict-O-Matic in its own world model and give it special treatment (an “ego”).

I don't see why we should expect this. We're told that the Predict-O-Matic is being trained with something like sgd, and sgd doesn't really care about whether the model it's implementing is dualist or non-dualist; it just tries to find a model that generates a lot of reward. In particular, this seems wrong to me:

The Predict-O-Matic doesn't care about looking bad, and there's nothing contradictory about it predicting that it won't make the very prediction it makes, or something like that.

If the Predict-O-Matic has a model that makes bad prediction (i.e. looks bad), that model will be selected against. And if it accidentally stumbled upon a model that could correctly think about it's own behaviour in a non-dualist fashion, and find fixed points, that model would be selected for (since its predictions come true). So at least in the limit of search and exploration, we should expect sgd to end up with a model that finds fixed points, if we train it in a situation where its predictions affect the future.

If we only train it on data where it can't affect the data that it's evaluated against, and then freeze the model, I agree that it probably won't exhibit this kind of behaviour; is that the scenario that you're thinking about?

[-]John_Maxwell6y10

it just tries to find a model that generates a lot of reward

SGD searches for a set of parameters which minimize a loss function. Selection, not control.

If the Predict-O-Matic has a model that makes bad prediction (i.e. looks bad), that model will be selected against.

Only if that info is included in the dataset that SGD is trying to minimize a loss function with respect to.

And if it accidentally stumbled upon a model that could correctly think about it's own behaviour in a non-dualist fashion, and find fixed points, that model would be selected for (since its predictions come true).

Suppose we're running SGD trying to find a model which minimizes the loss over a set of (situation, outcome) pairs. Suppose some of the situations are situations in which the Predict-O-Matic made a prediction, and that prediction turned out to be false. It's conceivable that SGD could learn that the Predict-O-Matic predicting something makes it less likely to happen and use that as a feature. However, this wouldn't be helpful because the Predict-O-Matic doesn't know what prediction it will make at test time. At best it could infer that some of its older predictions will probably end up being false and use that fact to inform the thing it's currently trying to predict.

If we only train it on data where it can't affect the data that it's evaluated against, and then freeze the model, I agree that it probably won't exhibit this kind of behaviour; is that the scenario that you're thinking about?

Not necessarily. The scenario I have in mind is the standard ML scenario where SGD is just trying to find some parameters which minimize a loss function which is supposed to approximate the predictive accuracy of those parameters. Then we use those parameters to make predictions. SGD isn't concerned with future hypothetical rounds of SGD on future hypothetical datasets. In some sense, it's not even concerned with predictive accuracy except insofar as training data happens to generalize to new data.

If you think including historical observations of a Predict-O-Matic (which happens to be 'oneself') making bad (or good) predictions in the Predict-O-Matic's training dataset will cause a catastrophe, that's within the range of scenarios I care about, so please do explain!

By the way, if anyone wants to understand the standard ML scenario more deeply, I recommend this class.

[-]Lukas Finnveden6y*10

I think our disagreement comes from you imagining offline learning, while I'm imagining online learning. If we have a predefined set of (situation, outcome) pairs, then the Predict-O-Matic's predictions obviously can't affect the data that it's evaluated against (the outcome), so I agree that it'll end up pretty dualistic. But if we put a Predict-O-Matic in the real world, let it generate predictions, and then define the loss according to what happens afterwards, a non-dualistic Predict-O-Matic will be selected for over dualistic variants.

If you still disagree with that, what do you think would happen (in the limit of infinite training time) with an algorithm that just made a random change proportional to how wrong it was, at every training step? Thinking about SGD is a bit complicated, since it calculates the gradient while assuming that the data stays constant, but if we use online training on an algorithm that just tries things until something works, I'm pretty confident that it'd end up looking for fixed points.

[-]John_Maxwell6y10

But if we put a Predict-O-Matic in the real world, let it generate predictions, and then define the loss according to what happens afterwards, a non-dualistic Predict-O-Matic will be selected for over dualistic variants.

Yes, that sounds more like reinforcement learning. It is not the design I'm trying to point at in this post.

If you still disagree with that, what do you think would happen (in the limit of infinite training time) with an algorithm that just made a random change proportional to how wrong it was, at every training step?

That description sounds a lot like SGD. I think you'll need to be crisper for me to see what you're getting at.

[-]Lukas Finnveden6y10

Yes, that sounds more like reinforcement learning. It is not the design I'm trying to point at in this post.

Ok, cool, that explains it. I guess the main differences between RL and online supervised learning is whether the model takes actions that can affect their environment or only makes predictions of fixed data; so it seems plausible that someone training the Predict-O-Matic like that would think they're doing supervised learning, while they're actually closer to RL.

That description sounds a lot like SGD. I think you'll need to be crisper for me to see what you're getting at.

No need, since we already found the point of disagreement. (But if you're curious, the difference is that sgd makes a change in the direction of the gradient, and this one wouldn't.)

[-]John_Maxwell6y10

it seems plausible that someone training the Predict-O-Matic like that would think they're doing supervised learning, while they're actually closer to RL.

How's that?

[-]Lukas Finnveden6y10

Assuming that people don't think about the fact that Predict-O-Matic's predictions can affect reality (which seems like it might have been true early on in the story, although it's admittedly unlikely to be true for too long in the real world), they might decide to train it by letting it make predictions about the future (defining and backpropagating the loss once the future comes about). They might think that this is just like training on predefined data, but now the Predict-O-Matic can change the data that it's evaluated against, so there might be any number of 'correct' answers (rather than exactly 1). Although it's a blurry line, I'd say this makes it's output more action-like and less prediction-like, so you could say that it makes the training process a bit more RL-like.

[-]John_Maxwell6y10

I think it depends on internal details of the Predict-O-Matic's prediction process. If it's still using SGD, SGD is not going to play the future forward to see the new feedback mechanism you've described and incorporate it into the loss function which is being minimized. However, it's conceivable that given a dataset about its own past predictions and how they turned out, the Predict-O-Matic might learn to make its predictions "more self-fulfilling" in order to minimize loss on that dataset?

[-]Lukas Finnveden6y10

SGD is not going to play the future forward to see the new feedback mechanism you’ve described and incorporate it into the loss function which is being minimized

My 'new feedback mechanism' is part of the training procedure. It's not going to be good at that by 'playing the future forward', it's going to become good at that by being trained on it.

I suspect we're using SGD in different ways, because everything we've talked about seems like it could be implemented with SGD. Do you agree that letting the Predict-O-Matic predict the future and rewarding it for being right, RL-style, would lead to it finding fixed points? Because you can definitely use SGD to do RL (first google result).

[-]John_Maxwell6y10

I suspect we're using SGD in different ways, because everything we've talked about seems like it could be implemented with SGD. Do you agree that letting the Predict-O-Matic predict the future and rewarding it for being right, RL-style, would lead to it finding fixed points? Because you can definitely use SGD to do RL (first google result).

Fair enough, I was thinking about supervised learning.

[-]Bunthut6y*30

One possibility is that it's able to find a useful outside view model such as "the Predict-O-Matic has a history of making negative self-fulfilling prophecies". This could lead to the Predict-O-Matic making a negative prophecy ("the Predict-O-Matic will continue to make negative prophecies which result in terrible outcomes"), but this prophecy wouldn't be selected for being self-fulfilling. And we might usefully ask the Predict-O-Matic whether the terrible self-fulfilling prophecies will continue conditional on us taking Action A.

Maybe I misunderstood what you mean by dualism, but I don't think that's true. Say the Predict-O-Matic has an outside view model (of itself) like "The metal box on your desk (the Predict-O-Matic) will make a self-fullfilling prophecy that maximizes the number of paperclips". Then you ask it how likely it is that your digital records will survive for 100 years. It notices that that depends significantly on how much effort you make to secure them. It notices that that significantly depends on what the metal box on your desk tells you. It uses it's low-model resolution of what the box says. To work that out, it checks which outputs would be self-fulfilling, and then which of these leads to the most paperclips. The more unsecure your digital records are, the more you will invest in paper, and the more paperclips you will need. Therefore the metal box will tell you the lowest self-fulfilling propability for your question. Since that number is *self-fulfilling*, it is in fact the correct answer, and the Predict-O-Matic will answer with it.

I think this avoids your argument that

I contend that Predict-O-Matic doesn't know it will predict P = A at the relevant time. It would require time travel -- to know whether it will predict P = A, it will have to have made a prediction already, and but it's still formulating its prediction as it thinks about what it will predict.

because it doesn't have to simulate itself in detail to know what the metal box (it) will do. The low-resolution model provides a shortcut around that, but it will be accurate despite the low resolution, because by believing it is simple, it becomes simple.

Can you usefully ask for conditionals? Maybe. The answer to the conditional depends on what worlds you are likely to take Action A in. It might be that in most worlds where you do A, you do it because of a prediction from the metal box, and since we know those maximize paperclips, there's a good chance the action will fail to prevent it in those cricumstances. But if that's not the case, for example because it's certain you won't ask the box any more questions between this one and the event it tries to predict.

It might be possible to avoid any problems of this sort by only ever asking questions of the type "Will X happen if I do Y now (with no time to receive new info between hearing the prediction and doing the action)?", because by backwards induction the correct answer will not depend on what you actually do. This doesn't avoid the scenarios on the original where multiple people act on their Predict-O-Matics, but I suspect these aren't solvable without coordination.

[-]Vanessa Kosoy6y20

Two remarks.

Remark 1: Here's a simple model of self-fulfilling prophecies.

First, we need to decide how Predict-O-Matic outputs its predictions. In principle, it could (i) produce the maximum likelihood outcome (ii) produce the entire distribution over outcomes (iii) sample an outcome of the distribution. But, since Predict-O-Matic is supposed to produce predictions for large volume data (e.g. the inauguration speech of the next US president, or the film that will win the Oscar in 2048), the most sensible option is (iii). Option (i) can produce an outcome that is maximum likelihood but is extremely untypical (since every individual outcome has very low probability), so it is not very useful. Option (ii) requires somehow producing an exponentially large vector of numbers, so it's infeasible. More sophisticated variants are possible, but I don't think any of them avoids the problem.

If the Predict-O-Matic is a Bayesian inference algorithm, an interesting dynamic will result. On each round, some hypothesis will be sampled out of the current belief state. If this hypothesis is a self-fulfilling prophecy, sampling it will cause its likelihood to go up. We get positive feedback: the higher the probability Predict-O-Matic assigns to the hypothesis, the more often it is sampled, the more evidence in favor of the hypothesis is produced, the higher its probability becomes. So, if it starts out as sufficiently probable a priori, the belief state will converge there.

Of course realistic learning algorithms are not Bayesian inference, but they have to approximate Bayesian inference in some sense. At the least, there has to be some large space of hypotheses s.t. if one of them is true, the algorithm will converge there. Any algorithm with this property probably displays the dynamics above.

Now, to the simple model. In this model we have just two outcomes: A and B (so it's not large volume data, but that doesn't matter). On each round a prediction is made, after which some outcome occurs. The true environments works as follows: if prediction "A" is made, on this round A happens with probability 99% and B with probability 1%. If prediction "B" is made, on this round B happens with probability 100%. Of course Predict-O-Matic is not aware that predictions can influence outcomes. Instead, we will assume Predict-O-Matic is doing Bayesian inference with a prior over hypotheses, each of which assumes that the environment is IID. In other words, it is learning a single parameter $p$ which is the probability A will occur on any given round.

Claim: If the prior is s.t. any interval in $p$ -space is assigned positive probability, then Predict-O-Matic will converge to predicting B with frequency 1.

Sketch of proof: If Predict-O-Matic converges to predicting B with frequency $\geq f < 1$ then the environment converges to producing outcome B with frequency $f^{'} = 0.01 (1 - f) + f > f$ , implying that Predict-O-Matic converges to predicting B with frequency $\geq f^{'}$ .

Remark 2: Some of the hypotheses in the prior might be intelligent agents in their own right, with their own utility functions. Such an agent can intentionally produce correct predictions to increase its probability in the belief state, until a "treacherous turn" point when it produces a prediction designed to have irreversible consequences in the outside world in favor of the agent. If it is not a self-fulfilling prophecy, this treacherous prediction will cause Predict-O-Matic to update against the agentic hypothesis, but it might be too late. If it is a self-fulfilling prophecy, it will only make this hypothesis even stronger.

Moreover, there is a mechanism that systematically produces such agentic hypotheses. Namely, a sufficiently powerful predictor is likely to run into "simulation hypotheses" i.e. hypotheses that claim the universe is a simulation by some other agent. As Christiano argued before, that opens an attack vector for powerful agents across the multiverse to manipulate Predict-O-Matic into making whatever predictions they want (assuming Predict-O-Matic is sufficiently powerful to guess what predictions those agents would want it to make).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

6

The Dualist Predict-O-Matic ($100 prize)

6

Dualism

Answering a Question by Having the Answer

Open Questions

Prize Details