All of Johannes Treutlein's Comments + Replies

Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.

Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representati... (read more)

2Richard Ngo5d
So I'm imagining the agent doing reasoning like: Misaligned goal --> I should get high reward --> Behavior aligned with reward function and then I'm hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make "I should get high reward" the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.) I could also imagine something more like: Misaligned goal --> I should behave in aligned ways --> Aligned behavior and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option. Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: "I should get high reward" and "I should behave in aligned ways", and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I'll have a post on that topic up soon).  

Thanks for your comment!

Regarding 1: I don't think it would be good to simulate superintelligences with our predictive models. Rather, we want to simulate humans to elicit safe capabilities. We talk more about competitiveness of the approach in Section III.

Regarding 3: I agree it might have been good to discuss cyborgism specifically. I think cyborgism is to some degree compatible with careful conditioning. One possible issue when interacting with the model arises when the model is trained on / prompted with its own outputs, or data that has been influence... (read more)

You are right, thanks for the comment! Fixed it now.

I like the idea behind this experiment, but I find it hard to tell from this write-up what is actually going on. I.e., what is exactly the training setup, what is exactly the model, which parts are hard-coded and which parts are learned? Why is it a weirdo janky thing instead of some other standard model or algorithm? It would be good if this was explained more in the post (it is very effortful to try to piece this together by going through the code). Right now I have a hard time making any inferences from the results.

Update: we recently discovered the performative prediction (Perdomo et al., 2020) literature (HT Alex Pan). This is a machine learning setting where we choose a model parameter (e.g., parameters for a neural network) that minimizes expected loss (e.g., classification error). In performative prediction, the distribution over data points can depend on the choice of model parameter. Our setting is thus a special case in which the parameter of interest is a probability distribution, the loss is a scoring function, and data points are discrete outcomes. Most re... (read more)

I think there should be a space both for in-progress research dumps and for more worked out final research reports on the forum. Maybe it would make sense to have separate categories for them or so.

I'm not sure I understand what you mean by a skill-free scoring rule. Can you elaborate what you have in mind?

2Matthew "Vaniver" Gray3mo
Sure, points from a scoring rule come both from 'skill' (whether or not you're accurate in your estimates) and 'calibration' (whether your estimates line up with the underlying propensity). Rather than generating the picture I'm thinking of (sorry, up to something else and so just writing a quick comment), I'll describe it: watch this animation [], and see the implied maximum expected score as a function of p (the forecaster's true belief). For all of the scoring rules, it's a convex function with maxima at 0 and 1. (You can get 1 point on average with a linear rule if p=0, and only 0.5 points on average if p=0.5; for a log rule, it's 0 points and -0.7 points.) But could you come up with a scoring rule where the maximum expected score as a function of p is flat? If true, there's no longer an incentive to have extreme probabilities. But that incentive was doing useful work before, and so this seems likely to break something else--it's probably no longer the case that you're incentivized to say your true belief--or require something like batch statistics (since I think you might be able to get something like this by scoring not individual predictions but sets of them, sorted by p or by whether they were true or false). [This can be done in some contexts with markets, where your reward depends on how close the market was to the truth before, but I think it probably doesn't help here because we're worried about the oracle's ability to affect the underlying reality, which is also an issue with prediction markets!] To be clear, I'm not at all confident this is possible or sensible--it seems likely to me that an adversarial argument goes thru where as oracle I always benefit from knowing which statements are true and which statements are false (even if I then lie about my beliefs to get a good calibration curve or w/e)--but that's not an argument about the scale of the distortions that are possible. 

Thanks for your comment!

Your interpretation sounds right to me. I would add that our result implies that it is impossible to incentivize honest reports in our setting. If you want to incentivize honest reports when is constant, then you have to use a strictly proper scoring rule (this is just the definition of “strictly proper”). But we show for any strictly proper scoring rule that there is a function such that a dishonest prediction is optimal.

Proposition 13 shows that it is possible to “tune” scoring rules to make optimal predictions very close to h... (read more)

2Matthew "Vaniver" Gray3mo
Agreed for proper scoring rules, but I'd be a little surprised if it's not possible to make a skill-free scoring rule, and then get a honest prediction result for that. [This runs into other issues--if the scoring rule is skill-free, where does the skill come from?--but I think this can be solved by having oracle-mode and observation-mode, and being able to do honest oracle-mode at all would be nice.]

I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity's potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape.

E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. ... (read more)

There is a chance that one can avoid having to solve ontology identification in general if one punts the problem to simulated humans. I.e., it seems one can train the human simulator without solving it, and then use simulated humans to solve the problem. One may have to solve some specific ontology identification problems to make sure one gets an actual human simulator and not e.g. a malign AI simulator. However, this might be easier than solving the problem in full generality.

Minor comment: regarding the RLHF example, one could solve the problem implicitl... (read more)

Great post!

I like that you point out that we'd normally do trial and error, but that this might not work with AI. I think you could possibly make clearer where this fails in your story. You do point out how HLMI might become extremely widespread and how it might replace most human work. Right now it seems to me like you argue essentially that the problem is a large-scale accident that comes from a distribution shift. But this doesn't yet say why we couldn't e.g. just continue trial-and-error and correct the AI once we notice that something is going wrong.&... (read more)

2Leon Lang6mo
Yes, after reflection I think this is correct. I think I had in mind a situation where with deployment, the training of the AI system simply stops, but of course, this need not be the case. So if training continues, then one either needs to argue stronger reasons why the distribution shift leads to a catastrophe (e.g., along the lines you argue) or make the case that the training signal couldn't keep up with the fast pace of the development. The latter would be an outer alignment failure, which I tried to avoid talking about in the text. 

Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.)

I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would... (read more)

I like this post and agree that there are different threat models one might categorize broadly under "inner alignment". Before reading this I hadn't reflected on the relationship between them.

Some random thoughts (after an in-person discussion with Erik):

  • For distributional shift and deception, there is a question of what is treated as fixed and what is varied when asking whether a certain agent has a certain property. E.g., I could keep the agent constant but put it into a new environment, and ask whether it is still aligned. Or I could keep the environmen
... (read more)
2Erik Jenner6mo
Thanks for the comments! I technically agree with what you're saying here, but one of the implicit claims I'm trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.

Great post!

Regarding your “Redirecting civilization” approach: I wonder about the competitiveness of this. It seems that we will likely build x-risk-causing AI before we have a good enough model to be able to e.g. simulate the world 1000 years into the future on an alternative timeline? Of course, competitiveness is an issue in general, but the more factored cognition or IDA based approaches seem more realistic to me.

Alternatively, we can try to be clever and “import” research from the future repeatedly. For instance we can first ask our model to produce r

... (read more)
2Adam Jermyn8mo
Thanks! I'm not sure. My sense is that generative models have a huge lead in terms of general capabilities over ~everything else, and that seems to be where the most effort is going today. So unless something changes there I expect generative models to be the state of the art when we hit x-risk territory. That said, it's totally possible that the x-risk-causing generative model happens before the model that can simulate thousands of years of history. I'm not confident in this either way. One thing that gives me hope in favor of simulating long histories is that to some extent it's "just" a matter of more compute, and if we get promising results simulating short spans of history it might not be hard to justify a lot of spending on simulating longer stretches. And there's a bright spot there too: simulating longer times likely scales sub-linearly with amount of history simulated. If you have a dynamics model then simulating for twice as long costs double the compute. If you've got a more clever model that knows how to take shortcuts/compress the dynamics you can probably do better. I'm pretty concerned about this. I said a bit about this in the "No Fixed Points" section, but basically I think you have to do something to avoid fixed points, otherwise you get all sorts of world-ending optimization pressures. If you do that, you're not allowed any recursion where the model simulates itself, and then you get stuck with the problem of how to introduce future research into the past without making a malicious AGI the most likely explanation...

These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models.

Your post seems to be focused more on pointing out a missing piece in the literature rather than asking for a solution to the specific problem (which I believe is a valuable contribution). Regardless, here is roughly how I would understand “what they mean”:

Let  be the task space,  the output space,  the model space,  our base objective, and  the mesa objective of the model for input . Assume that there exists some map  mapping internal objects to outputs by the model,... (read more)

1Johannes Treutlein8mo
These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models [].

Thank you!

It does seem like simulating text generated by using similar models would be hard to avoid when using the model as a research assistant. Presumably any research would get “contaminated” at some point, and models might seize to be helpful without updating them on the newest research.

In theory, if one were to re-train models from scratch on the new research, this might be equivalent to the models updating on the previous models' outputs before reasoning about superrationality, so it would turn things into a version of Newcomb's problem with transpa... (read more)

2Adam Jermyn8mo
Oh interesting. I think this still runs into the issue that you'll have instrumental goals whenever you ask the model to simulate itself (i.e. just the first step in the hierarchy hits this issue). I was imagining that we train the model to predict e.g. tomorrow's newspaper given today's. The fact that it's not just a stream of text but comes with time-stamps (e.g. this was written X hours later) feels important for making it simulate actual histories.

Thanks for your comment! I agree that we probably won't be able to get a textbook from the future just by prompting a language model trained on human-generated texts.

As mentioned in the post, maybe one could train a model to also condition on observations. If the model is very powerful, and it really believes the observations, one could make it work. I do think sometimes it would be beneficial for a model to attain superhuman reasoning skills, even if it is only modeling human-written text. Though of course, this might still not happen in practice.

Overall ... (read more)

Would you count issues with malign priors etc. also as issues with myopia? Maybe I'm missing something about what myopia is supposed to mean and be useful for, but these issues seem to have a similar spirit of making an agent do stuff that is motivated by concerns about things happening at different times, in different locations, etc.

E.g., a bad agent could simulate 1000 copies of the LCDT agent and reward it for a particular action favored by the bad agent. Then depending on the anthropic beliefs of the LCDT agent, it might behave so as to maximize this r... (read more)

If someone had a strategy that took two years, they would have to over-bid in the first year, taking a loss. But then they have to under-bid on the second year if they're going to make a profit, and--"

"And they get undercut, because someone figures them out."

I think one could imagine scenarios where the first trader can use their influence in the first year to make sure they are not undercut in the second year, analogous to the prediction market example. For instance, the trader could install some kind of encryption in the software that this company use... (read more)

I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one's reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?

1Johannes Treutlein9mo
I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one's reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?

Wolfgang Spohn develops the concept of a "dependency equilibrium" based on a similar notion of evidential best response (Spohn 2007, 2010). A joint probability distribution is a dependency equilibrium if all actions of all players that have positive probability are evidential best responses. In case there are actions with zero probability, one evaluates a sequence of joint probability distributions such that and for all actions and . Using your notation of a probability matrix and a utility matrix, the expected utili

... (read more)

Thanks for your answer! This "gain" approach seems quite similar to what Wedgwood (2013) has proposed as "Benchmark Theory", which behaves like CDT in cases with, but more like EDT in cases without causally dominant actions. My hunch would be that one might be able to construct a series of thought-experiments in which such a theory violates transitivity of preference, as demonstrated by Ahmed (2012).

I don't understand how you arrive at a gain of 0 for not smoking as a smoke-lover in my example. I would think the gain for not smoking is higher:

... (read more)
0Abram Demski6y
Ah, you're right. So gain doesn't achieve as much as I thought it did. Thanks for the references, though. I think the idea is also similar in spirit to a proposal of Jeffrey's in him book The Logic of Decision; he presents an evidential theory, but is as troubled by cooperating in prisoner's dilemma and one-boxing in Newcomb's problem as other decision theorists. So, he suggests that a rational agent should prefer actions such that, having updated on probably taking that action rather than another, you still prefer that action. (I don't remember what he proposed for cases when no such action is available.) This has a similar structure of first updating on a potential action and then checking how alternatives look from that position.

From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT

I agree with this.

It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one.

I’d also be interested in finding such a problem.

I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a

... (read more)
I think that in that case, the agent shouldn't smoke, and CDT is right, although there is side-channel information that can be used to come to the conclusion that the agent should smoke. Here's a reframing of the provided payoff matrix that makes this argument clearer. (also, your problem as stated should have 0 utility for a nonsmoker imagining the situation where they smoke and get killed) Let's say that there is a kingdom which contains two types of people, good people and evil people, and a person doesn't necessarily know which type they are. There is a magical sword enchanted with a heavenly aura, and if a good person wields the sword, it will guide them do heroic things, for +10 utility (according to a good person) and 0 utility (according to a bad person). However, if an evil person wields the sword, it will afflict them for the rest of their life with extreme itchiness, for -100 utility (according to everyone). good person's utility estimates: * takes sword * I'm good: 10 * I'm evil: -90 * don't take sword: 0 evil person's utility estimates: * takes sword * I'm good: 0 * I'm evil: -100 * don't take sword: 0 As you can clearly see, this is the exact same payoff matrix as the previous example. However, now it's clear that if a (secretly good) CDT agent believes that most of society is evil, then it's a bad idea to pick up the sword, because the agent is probably evil (according to the info they have) and will be tormented with itchiness for the rest of their life, and if it believes that most of society is good, then it's a good idea to pick up the sword. Further, this situation is intuitively clear enough to argue that CDT just straight-up gets the right answer in this case. A human (with some degree of introspective power) in this case, could correctly reason "oh hey I just got a little warm fuzzy feeling upon thinking of the hypothetical where I wield the sword and it doesn't curse me. This is evidence that I'm good,
0Abram Demski6y
Excellent example. It seems to me, intuitively, that we should be able to get both the CDT feature of not thinking we can control our utility function through our actions and the EDT feature of taking the information into account. Here's a somewhat contrived decision theory which I think captures both effects. It only makes sense for binary decisions. First, for each action you compute the posterior probability of the causal parents for each decision. So, depending on details of the setup, smoking tells you that you're likely to be a smoke-lover, and refusing to smoke tells you that you're more likely to be a non-smoke-lover. Then, for each action, you take the action with best "gain": the amount better you do in comparison to the other action keeping the parent probabilities the same: Gain(a)=E(U|a)−E(U|a,do(¯a)) (E(U|a,do(¯a)) stands for the expectation on utility which you get by first Bayes-conditioning on a, then causal-conditioning on its opposite.) The idea is that you only want to compare each action to the relevant alternative. If you were to smoke, it means that you're probably a smoker; you will likely be killed, but the relevant alternative is one where you're also killed. In my scenario, the gain of smoking is +10. On the other hand, if you decide not to smoke, you're probably not a smoker. That means the relevant alternative is smoking without being killed. In my scenario, the smoke-lover computes the gain of this action as -10. Therefore, the smoke-lover smokes. (This only really shows the consistency of an equilibrium where the smoke-lover smokes -- my argument contains unjustified assumption that smoking is good evidence for being a smoke lover and refusing to smoke is good evidence for not being one, which is only justified in a circular way by the conclusion.) In your scenario, the smoke-lover computes the gain of smoking at +10, and the gain of not smoking at 0. So, again, the smoke-lover smokes. The solution seems too ad-hoc to really

Thanks for the link! What I don't understand is how this works in the context of empirical and logical uncertainty. Also, it's unclear to me how this approach relates to Bayesian conditioning. E.g. if the sentence "if a holds, than o holds" is true, doesn't this also mean that P(o|a)=1? In that sense, proof-based UDT would just be an elaborate specification of how to assign these conditional probabilities "from the viewpoint of the original position", so with updatelessness, and in the context of full logical inference and knowledge of ... (read more)

1Vladimir Slepnev6y
To me, proof-based UDT is a simple framework that includes probabilistic/Bayesian reasoning as a special case. For example, if the world is deterministic except for a single coinflip, you specify a preference ordering on pairs of outcomes of two deterministic worlds. Fairness or non-fairness of the coinflip will be encoded into the ordering, so the decision can be based on completely deterministic reasoning. All probabilistic situations can be recast in this way. That's what UDT folks mean by "probability as caring". It's really cool that UDT lets you encode any setup with probability, prediction, precommitment etc. into a few (complicated and self-referential) sentences in PA [] or GL [] that are guaranteed to have truth values. And since GL is decidable, you can even write a program that will solve all such problems for you.

Thanks a lot for your elaborate reply!

(So I'm not even sure what CDT is supposed to do here, since it's not clear that the bet is really on the past state of the world and not on truth of a proposition about the future state of the world.)

Hmm, good point. The truth of the proposition is evaluated on basis of Alice's action, which she can causally influence. But we could think of a Newcomblike scenario in which someone made a perfect prediction a 100 years ago and put down a note about what state the world was in at that time. Now instead of checking Al... (read more)

That's what I was trying to do with the Coin Flip Creation :) My guess: once you specify the Smoking Lesion and make it unambiguous, it ceases to be an argument against EDT.

What exactly do you think we need to specify in the Smoking Lesion?

I suspect this is a confusion about free will. To be concrete, I think that a thermostat has a causal influence on the future, and does not violate determinism. It deterministically observes a sensor, and either turns on a heater or a cooler based on that sensor, in a way that does not flow backwards--turning on the heater manually will not affect the thermostat's attempted actions except indirectly through the eventual effect on the sensor.

Fair point :) What I meant was that for every world history, there is only one causal influence I could possibly h... (read more)

I agree with points 1) and 2). Regarding point 3), that's interesting! Do you think one could also prove that if you don't smoke, you can't (or are less likely to) have the gene in the Smoking Lesion? (See also my response to Vladimir Nesov's comment.)

1Vladimir Slepnev6y
I can only give a clear-cut answer if you reformulate the smoking lesion problem in terms of Omega and specify the UDT agent's egoism or altruism :-)

The point of decision theories is not that they let you reach from beyond the Matrix and change reality in violation of physics; it's that you predictably act in ways that optimize for various criteria.

I agree with this. But I would argue that causal counterfactuals somehow assume that we can "reach from beyond the Matrix and change reality in violation of physics". They work by comparing what would happen if we detached our “action node” from its ancestor nodes and manipulated it in different ways. So causal thinking in some way seems to viol... (read more)

1Matthew "Vaniver" Gray6y
I agree there's a point here that lots of decision theories / models of agents / etc. are dualistic instead of naturalistic, but I think that's orthogonal to EDT vs. CDT vs. LDT; all of them assume that you could decide to take any of the actions that are available to you. I suspect this is a confusion about free will. To be concrete, I think that a thermostat has a causal influence on the future, and does not violate determinism. It deterministically observes a sensor, and either turns on a heater or a cooler based on that sensor, in a way that does not flow backwards--turning on the heater manually will not affect the thermostat's attempted actions except indirectly through the eventual effect on the sensor. This depends on the formulation of Newcomb's problem. If it says "Omega predicts you with 99% accuracy" or "Omega always predicts you correctly" (because, say, Omega is Laplace's Demon), then Omega knew that you would learn about decision theory in the way that you did, and there's still a logical dependence between the you looking at the boxes in reality and the you looking at the boxes in Omega's imagination. (This assumes that the 99% fact is known of you in particular, rather than 99% accuracy being something true of humans in general; this gets rid of the case that 99% of the time people's decision theories don't change, but 1% of the time they do, and you might be in that camp.) If instead the formulation is "Omega observed the you of 10 years ago, and was able to determine whether or not you then would have one-boxed or two-boxed on traditional Newcomb's with perfect accuracy. The boxes just showed up now, and you have to decide whether to take one or both," then the logical dependence is shattered, and two-boxing becomes the correct move. If instead the formulation is "Omega observed the you of 10 years ago, and was able to determine whether or not you then would have one-boxed or two-boxed on this version of Newcomb's with perfect accuracy. The bo

Thanks for your comment! I find your line of reasoning in the ASP problem and the Coin Flip Creation plausible. So your point is that, in both cases, by choosing a decision algorithm, one also gets to choose where this algorithm is being instantiated? I would say that in the CFC, choosing the right action is sufficient, while in the ASP you also have to choose the whole UDP program so as to be instantiated in a beneficial way (similar to the distinction of how TDT iterates over acts and UDT iterates over policies).

Would you agree that the Coin Flip Creatio... (read more)

1Vladimir Nesov6y
To clarify, it's the algorithm itself that chooses how it behaves. So I'm not talking about how algorithm's instantiation depends on the way programmer chooses to write it, instead I'm talking about how algorithm's instantiation depends on the choices that the algorithm itself makes, where we are talking about a particular algorithm that's already written. Less mysteriously, the idea of algorithm's decisions influencing things describes a step in the algorithm, it's how the algorithm operates, by figuring out something we could call "how algorithm's decisions influence outcomes". The algorithm then takes that thing and does further computations that depend on it.