Recent Discussion

Impact measurement and value-neutrality verification
164d6 min readShow Highlight

Some clarifications I got from Evan (evhub) on MIRIxDiscord:

  1. AI not being value-neutrality is one way that the strategy-stealing assumption might be false, and therefore one thing we can work on if we want to make the strategy-stealing assumption true (or true to the extent possible).
  2. It's not clear if "AI not being value-neutrality" falls into one of Paul's 11 failure scenarios for strategy-stealing. The closest seems to be failure #1 "AI alignment" but one could also argue that an AI can be aligned but still not value-neutral.
  3. The "neutrality" measure f
... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post
3Ofer G.3d Very interesting! Regarding value-neutrality verification: If deceptive alignment [] occurs, the model might output whatever minimizes the neutrality measure, as an instrumental goal [ETA: and it might not do that when it detects that it is currently not being used for computing the neutrality measure]. In such a case it seems that a successful verification step shouldn't give us much assurance about the behavior of the model.
2Evan Hubinger3d Note that the model's output isn't what's relevant for the neutrality measure; it's the algorithm it's internally implementing. That being said, this sort of trickery is still possible if your model is non-myopic [], which is why it's important to have some sort of myopia guarantee [] .
7Alex Turner4d I really like this view. An additional frame of interest is: signed neutrality (just remove the absolute value) as a measure of opportunity cost propensity. That is, highly non-neutral policies lead to polarizing opportunity costs. For example, consider a maze in which half your possible destinations lie through the one-way door on the left, and half through the one-way door on the right. All policies which go anywhere are highly “polarizing” / non-neutral. I agree that this moment of neutrality is also a facet of the “power/impact” phenomenon. However, I’m not sure I follow this part: We can think of actions as having objective impact to the extent that they change the distribution over which values have control over which resources—that is, the extent to which they are not value-neutral. Or, phrased another way, actions have objective impact to the extent that they break the strategy-stealing assumption. Avoiding deactivation is good for almost all goals, so there isn’t much stdev under almost any Y? Or maybe you’re using “objective impact” in a slightly different sense here? In any case, I think I get what you’re pointing at.
3Evan Hubinger3d You're right, I think the absolute value might actually be a problem—you want the policy to help/hurt all values relative to no-op equally, not hurt some and help others. I just edited the post to reflect that. As for the connection between neutrality and objective impact, I think this is related to a confusion that Wei Dai pointed out, which is that I was sort of waffling between two different notions of strategy-stealing, those being: 1. strategy-stealing relative to all the agents present in the world (i.e. is it possible for your AI to steal the strategies of other agents in the world) and 2. strategy-stealing relative to a single AI (i.e. if that AI were copied many times and put in service of many different values, would it advantage some over others). If you believe that most early AGIs will be quite similar in their alignment properties (as I generally do, since I believe that copy-and-paste is quite powerful and will generally be preferred over designing something new), then these two notions of strategy-stealing match up, which was why I was waffling between them. However, conceptually they are quite distinct. In terms of the connection between neutrality and objective impact, I think there I was thinking about strategy-stealing in terms of notion 1, whereas for most of the rest of the post I was thinking about it in terms of notion 2. In terms of notion 1, objective impact is about changing the distribution of resources among all the agents in the world.
Random Thoughts on Predict-O-Matic
1320h9 min readShow Highlight

I'm going to be a bit more explicit about some ideas that appeared in The Parable of Predict-O-Matic. (If you don't want spoilers, read it first. Probably you should read it first anyway.)

[Note: while the ideas here are somewhat better than the ideas in the predict-o-matic story, they're equally rambling, without the crutch of the story to prop them up. As such, I expect readers to be less engaged. Unless you're especially interested in which character's remarks are true (or at least, which ones I stand by), this might be a post to skim; I don't think it has enoug... (Read more)

In a recent post, Wei Dai mentions a similar distinction (italics added by me):

Supervised training—This is safer than reinforcement learning because we don’t have to worry about reward hacking (i.e., reward gaming and reward tampering), and it eliminates the problem of self-confirming predictions (which can be seen as a form of reward hacking). In other words, if the only thing that ever sees the Oracle’s output during a training episode is an automated system that computes the Oracle’s reward/loss, and that system is secure because it’s just computing

... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post
Technical AGI safety research outside AI
104h3 min readShow Highlight

I think there are many questions whose answers would be useful for technical AGI safety research, but which will probably require expertise outside AI to answer. In this post I list 30 of them, divided into four categories. Feel free to get in touch if you’d like to discuss these questions and why I think they’re important in more detail. I personally think that making progress on the ones in the first category is particularly vital, and plausibly tractable for researchers from a wide range of academic backgrounds.

Studying and understanding safety problems

  1. How strong are the econo
... (Read more)
The Dualist Predict-O-Matic ($100 prize)
42d4 min readShow Highlight

This is a response to Abram's The Parable of Predict-O-Matic, but you probably don't need to read Abram's post to understand mine. While writing this, I thought of a way in which I think things could wrong with dualist Predict-O-Matic, which I plan to post in about a week. I'm offering a $100 prize to the first commenter who's able to explain how things might go wrong in a sufficiently crisp way before I make my follow-up post.


Currently, machine learning algorithms are essentially "Cartesian dualists" when it comes to themselves and their environment. (Not a philosophy major -- let

... (Read more)
1Abram Demski10h Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn't optimized for being so? Or are you trying to distinguish between "explicit" and "implicit" searches for fixed points? More the second than the first, but I'm also saying that the line between the two is blurry. For example, suppose there is someone who will often do what predict-o-matic predicts if they can understand how to do it. They often ask it what they are going to do. At first, predict-o-matic predicts them as usual. This modifies their behavior to be somewhat more predictable than it normally would be. Predict-o-matic locks into the patterns (especially the predictions which work the best as suggestions). Behavior gets even more regular. And so on. You could say that no one is optimizing for fixed-point-ness here, and predict-o-matic is just chancing into it. But effectively, there's an optimization implemented by the pair of the predict-o-matic and the person. In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn't explicitly searching for that.
1John_Maxwell14h it just tries to find a model that generates a lot of reward SGD searches for a set of parameters which minimize a loss function. Selection, not control []. If the Predict-O-Matic has a model that makes bad prediction (i.e. looks bad), that model will be selected against. Only if that info is included in the dataset that SGD is trying to minimize a loss function with respect to. And if it accidentally stumbled upon a model that could correctly think about it's own behaviour in a non-dualist fashion, and find fixed points, that model would be selected for (since its predictions come true). Suppose we're running SGD trying to find a model which minimizes the loss over a set of (situation, outcome) pairs. Suppose some of the situations are situations in which the Predict-O-Matic made a prediction, and that prediction turned out to be false. It's conceivable that SGD could learn that the Predict-O-Matic predicting something makes it less likely to happen and use that as a feature. However, this wouldn't be helpful because the Predict-O-Matic doesn't know what prediction it will make at test time. At best it could infer that some of its older predictions will probably end up being false and use that fact to inform the thing it's currently trying to predict. If we only train it on data where it can't affect the data that it's evaluated against, and then freeze the model, I agree that it probably won't exhibit this kind of behaviour; is that the scenario that you're thinking about? Not necessarily. The scenario I have in mind is the standard ML scenario where SGD is just trying to find some parameters which minimize a loss function which is supposed to approximate the predictive accuracy of those parameters. Then we use those parameters to make predictions. SGD isn't concerned with future hypothetical rounds of SGD on future hypothetical datasets. In some sense, it's not even concerned with predic

I think our disagreement comes from you imagining offline learning, while I'm imagining online learning. If we have a predefined set of (situation, outcome) pairs, then the Predict-O-Matic's predictions obviously can't affect the data that it's evaluated against (the outcome), so I agree that it'll end up pretty dualistic. But if we put a Predict-O-Matic in the real world, let it generate predictions, and then define the loss according to what happens afterwards, a non-dualistic Predict-O-Matic will be selected for over dualistic v... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post

1John_Maxwell14h In particular, I think we should expect ML to be biased towards simple functions such that if there's a simple and obvious compression, then you should expect ML to take it. Yes, for the most part. In particular, having an "ego" which identifies itself with its model of itself significantly reduces description length by not having to duplicate a bunch of information about its own decision-making process. I think maybe you're pre-supposing what you're trying to show. Most of the time, when I train a machine learning model on some data, that data isn't data about the ML training algorithm or model itself. This info is usually not part of the dataset whose description length the system is attempting to minimize. A machine learning model doesn't get understanding of or data about its code "for free", in the same way we don't get knowledge of how brains work "for free" despite the fact that we are brains. Humans get self-knowledge in basically the same way we get any other kind of knowledge--by making observations. We aren't expert neuroscientists from birth. Part of what I'm trying to indicate with the "dualist" term is that this Predict-O-Matic is the same way, i.e. its position with respect to itself is similar to the position of an aspiring neuroscientist with respect to their own brain.
1Evan Hubinger11h Most of the time, when I train a machine learning model on some data, that data isn't data about the ML training algorithm or model itself. If the data isn't at all about the ML training algorithm, then why would it even build a model of itself in the first place, regardless of whether it was dualist or not? A machine learning model doesn't get understanding of or data about its code "for free", in the same way we don't get knowledge of how brains work "for free" despite the fact that we are brains. We might not have good models of brains, but we do have very good models of ourselves, which is the actual analogy here. You don't have to have a good model of your brain to have a good model of yourself, and to identify that model of yourself with your own actions (i.e. the thing you called an "ego"). Part of what I'm trying to indicate with the "dualist" term is that this Predict-O-Matic is the same way, i.e. its position with respect to itself is similar to the position of an aspiring neuroscientist with respect to their own brain. Also, if you think that, then I'm confused why you think this is a good safety property; human neuroscientists are precisely the sort of highly agentic misaligned mesa-optimizers that you presumably want to avoid when you just want to build a good prediction machine. -- I think I didn't fully convey my picture here, so let me try to explain how I think this could happen. Suppose you're training a predictor and the data includes enough information about itself that it has to form some model of itself. Once that's happened--or while it's in the process of happening--there is a massive duplication of information between the part of the model that encodes its prediction machinery and the part that encodes its model of itself. A much simpler model would be one that just uses the same machinery for both, and since ML is biased towards simple models, you should expect it to be shared--which is precisely the thing you were calling an "ego."
Relaxed adversarial training for inner alignment
151mo27 min readShow Highlight

For the Alignment Newsletter:


Previously, Paul Christiano proposed creating an adversary to search for inputs that would make a powerful model behave "unacceptably" and then penalizing the model accordingly. To make the adversary's job easier, Paul relaxed the problem so that it only needed to find a pseudo-input, which can be thought of as a distribution over possible inputs. This post expands on Paul's proposal by first defining a formal unacceptability penalty and then analyzing a number of scenarios in light of this framewor... (Read more)(Click to expand thread. ⌘F to Expand All)Cmd/Ctrl F to expand all comments on this post

The Parable of Predict-O-Matic
344d13 min readShow Highlight

I've been thinking more about partial agency. I want to expand on some issues brought up in the comments to my previous post, and on other complications which I've been thinking about. But for now, a more informal parable. (Mainly because this is easier to write than my more technical thoughts.)

This relates to oracle AI and to inner optimizers, but my focus is a little different.


Suppose you are designing a new invention, a predict-o-matic. It is a wonderous machine which will predict everything for us: weather, politics, the newest advances in quantum physics, you name it. The machi... (Read more)

4Vanessa Kosoy2d This was extremely entertaining and also had good points. For now, just one question: ...The intern was arguing that minimizing prediction error would have all kinds of unintended bad effects. Which was crazy enough. The engineer was worse: they were arguing that Predict-O-Matic might maximize prediction error! Some kind of duality principle. Minimizing in one direction means maximizing in the other direction. Whatever that means. Is this a reference to duality in optimization []? If so, I don't understand the formal connection?
5Abram Demski2d No, it was a speculative conjecture which I thought of while writing. The idea is that incentivizing agents to lower the error of your predictions (as in a prediction market) looks exactly like incentivizing them to "create" information (find ways of making the world more chaotic), and this is no coincidence. So perhaps there's a more general principle behind it, where trying to incentivize minimization of f(x,y) only through channel x (eg, only by improving predictions) results in an incentive to maximize f through y, under some additional assumptions. Maybe there is a connection to optimization duality in there. In terms of the fictional cannon, I think of it as the engineer trying to convince the boss by simplifying things and making wild but impressive sounding conjectures. :)
1Isnasene3d Don't mind me; just trying to summarize some of the stuff I just processed. If you're choosing a strategy of predicting the future based on how accurate it turns out to be, the strategy who's output influences the future in ways that make its prediction more likely will outperform a strategy that doesn't (all else being equal). Thus, one might think that the strategy you chose will be the strategy that most effectively balances its prediction between a) how accurate that prediction (unconditioned on the prediction being given) and b) how much the prediction itself improves the accuracy of the prediction (conditioning on the prediction). Because of this, the intern predicts that the world will be made more predictable than it would be normally. In short, you'll tend to choose the prediction strategies that give self-fulfilling predictions when possible over those that don't. However, choosing the strategy that predicts the future most accurately is also equivalent to throwing away every strategy that doesn't predict the future the best. In the same way that self-fulfilling predictions are good for prediction strategies because they enhance accuracy of the strategy in question, self-fulfilling predictions that seem generally surprising to outside observers are even better because they lower the accuracy of competing strategies. The established prediction strategy thus systematically causes the kinds of events in the world that no other method could predict to further establish itself. Because of this, the engineer predicts that the world will become less predictable than it would be normally. In short, you'll tend to choose the prediction strategy that give self-fulfilling predictions which fulfill in maximally surprising ways relative to the other prediction strategies you are considering. Oh god...

I'm actually trying to be somewhat agnostic about the right conclusion here. I could have easily added another chapter discussing why the maximizing-surprise idea is not quite right. The moral is that the questions are quite complicated, and thinking vaguely about 'optimization processes' is quite far from adequate to understand this. Furthermore, it'll depend quite a bit on the actual details of a training procedure!

Load More