I had a conversation with Nate about different possible goal systems for agents, and I think some people will be interested in reading this summary.

Goal specification

I started by stating my skepticism about approaches to goal specification that rely on inspecting an AI's world model and identifying some concept in them (e.g. paperclips) to specify the goal in terms of. To me, this seems fairly doomed: it is difficult to imagine a kind of language for describing concepts, such that I could specify some concept I cared about (e.g. paperclips) in this language, and I could trust a system to correctly carry out a goal specified in terms of this concept. Even if we had a nicer theory of multi-level models, it still seems unlikely that this theory would match human concepts well enough that it would be possible to specify things we care about in this theory. See also Paul's comment on this subject and his post on unsupervised learning.

Nate responded that it seems like humans can learn a concept from fairly few examples. To the extent that we expect AIs to learn "natural categories", and we expect to be able to point at natural categories with a few examples or views of the concept, this might work.

Nate argued that corrigibility might be a natural concept, and one that is useful for specifying some proxy for what we care about. This is partially due to introspection on the concept of corrigibility ("knowing that you're flawed and that the goal you were given is not an accurate reflection of your purpose"), and partially due to the fact that superintelligences might want to build corrigible subagents.

This didn't seem completely implausible to me, but it didn't seem very likely that this would end up saving the goal-directed approach. Then we started getting into the details of alternative proposals that specify goals in terms of short-term predictions (specifically, human-imitation and other act-based approaches).

I argued that there's an important advantage to systems whose goals are grounded in short-term predictions: you can use a scheme like this to do something useful if you have a mixture of good and bad predictors, by testing these predictors against reality. There is no analogous way of testing e.g. good and bad paperclip concepts against reality, to see which one actually represents paperclips. Nate agreed that this is an advantage for grounding goals in prediction. In particular, he agreed that specifying goals in terms of human predictions will likely be the best idea for the first powerful AGIs, although he's less pessimistic than me about other approaches.

Nate pointed out some problems with systems based on powerful predictors. If a predictor can predict a system containing consequentialists (e.g. a human in a room), then it is using some kind of consequentialist machinery internally to make these predictions. For example, it might be modelling the human as an approximately rational consequentialist agent. This presents some problems. If the predictor simulates consequentialist agents in enough detail, then these agents might try to break out of the system. Presumably, we would want to know that these consequentialists are safe. It's possible that the scheme for handling predictors works for preventing these consequentialists from gaining much power, but a "defense in depth" approach would involve understanding these consequentialists better. Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place.

In particular, at least one of the consequentialists in the world model must represent a human for the predictor to make accurate predictions of humans. It's substantially easier to specify a class of models that contains a good approximation to a human (which might be all you need for human-prediction approaches) than to specify a good approximation to a human, but it still seems difficult either way. It's possible that a better understanding of consequentialism will lead to better models for human-prediction (although at the moment, this seems like a fairly weak reason to study consequentialism to me).

Logic optimizers

We also talked about the idea of a "logic optimizer". This is a hypothetical agent that is given a description of the environment it is in (as a computer program) and optimizes this environment according to some easily-defined objective (similar to modal UDT). One target might be a "naturalized AIXI", which in some sense does this job almost as well as any simple Turing machine. This should be an asymptotic solution that works well in an environment larger than it, as both it and the environment become very large.

I was skeptical that this research path gives us what we want. The things we actually care about can't be expressed easily in terms of physics or logic. Nate predicted that, if he understood how to build a naturalized AIXI, then this would make some other things less confusing. He would have more ideas for what to do after finding this: perhaps making the system more efficient, or extending it to optimize higher-level aspects of physics/logic.

It seems to me that the place where you would actually use a logic optimizer is not to optimize real-world physics, but to optimize the internal organization of the AI. Since the AI's internal organization is defined as a computer program, it is fairly easy to specify goals related to the internal organization in a format suitable for a logic optimizer (e.g. specifying the goal of maximizing a given mathematical function). This seems identical to the idea of "platonic goals". It's possible that the insights from understanding logic optimizers might generalize to more real-world goals, but I find internal organization to be the most compelling concrete application.

Paul has also written about using consequentialism for the internal organization of an AI system. He argues that, when you're using consequentialism to e.g. optimize a mathematical function, even very bad theoretical targets for what this means seem fine. I partially agree with this: it seems like there is much more error tolerance for badly optimizing a mathematical function, versus badly optimizing the universe. In particular, if you have a set of function optimizers that contains a good function optimizer, then you can easy combine these function optimizers into a single good function optimizer (just take the argmax over their outputs). The main danger is if all of your best "function optimizers" actually care about the real world, because you didn't know how to build one that only cares about the internal objective.

Paul is skeptical that a better theoretical formulation of rational agency would actually help to design more effective and understandable internal optimizers (e.g. function optimizers). It seems likely that we'll be stuck with analyzing the algorithms that end up working, rather than designing algorithms according to theoretical targets.

I talked to Nate about this and he was more optimistic about getting useful internal optimizers if we know how to solve logic optimization problems using a hypercomputer (in an asymptotic way that works when the agent is smaller than the environment). He was skeptical about ways of "solving" the problem without being able to accomplish this seemingly easier goal.

I'm not sure what to think about how useful theory is. The most obvious parallel is to look at formalisms like Solomonoff induction and AIXI, and see if those have helped to make current machine learning systems more principled. I don't have a great idea of what most important AI researchers think of AIXI, but I think it's helped me to understand what some machine learning systems are actually doing. Some of the people who worked with these theoretical formalisms (Juergen Schmidhuber, Shane Legg, perhaps others?) went on to make advances in deep learning, which seems like an example of using a principled theory to understand a less-principled algorithm better. It's important to disentangle "understanding AIXI helped these people make deep learning advances" from "more competent researchers are more drawn to AIXI", but I would still guess that studying AIXI helped them. Another problem with this analogy is that, if naturalized AIXI is the right paradigm in a way that AIXI isn't, then it is more likely to yield practical algorithms than AIXI is.

Roughly, if naturalized AIXI is a comparable theoretical advance to Solomonoff induction/AIXI (which seems likely), then I am somewhat optimistic about it making future AI systems more principled.

Conclusion and research priorities

My concrete takeaways are:

  1. Specifying real-world goals in a way that doesn't reduce to short-term human prediction doesn't seem promising for now. New insights might make this problem look easier, but this doesn't seem very likely to me.
  2. To the extent that we expect powerful systems to need to use consequentialist reasoning to organize their internals, and to the extent that we can make theoretical progress on the problem, it seems worth working on a "naturalized AIXI". It looks like a long shot, but it seems reasonable to at least gather information about how easy it is for us to make progress on it by trying to solve it.

In the near future, I think I'll split my time between (a) work related to act-based systems (roughly following Paul's recommended research agenda), and (b) work related to logic optimizers, with emphasis on using these for the internal organization of the AI (rather than goals related to the real world). Possibly, some work will be relevant to both of these projects. I'll probably change my research priorities if any of a few things happens:

  1. goals related to the external world start seeming less doomed to me
  2. the act-based approach starts seeming more doomed to me
  3. the "naturalized AIXI" approach starts seeming more or less useful/tractable
  4. I find useful things to do that don't seem relevant to either of these two projects
New Comment
50 comments, sorted by Click to highlight new comments since: Today at 6:40 AM

"Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place."

I've had this conversation with Nate before, and I don't understand why I should think it's true. Presumably we think we will eventually be able to make predictors that predict a wide variety of systems without us understanding every interesting subset ahead of time, right? Why are consequentialists different?

Here is my understanding of the argument:

(Warning, very long post, partly thinking out loud. But I endorse the summary. I would be most interested in Eliezer's response.)

  • Something vaguely "consequentialist" is an important part of how humans reason about hard cognitive problems of all kinds (e.g. we must decide what cognitive strategy to use, what to focus our attention on, what topics to explore and which to ignore).
  • It's not clear what prediction problems require this kind of consequentialism and what kinds of prediction problems can be solved directly by a brute force search for predictors. (I think Ilya has suggested that the cutoff is something like "anything a human can do in 100ms, you can train directly.")
  • However, the behavior of an intelligent agent is in some sense a "universal" example of a hard-to-predict-without-consequentialism phenomenon.
  • If someone claims to have a solution that "just" requires a predictor, then they haven't necessarily reduced the complexity of the problem, given that a good predictor depend on something consequentialist. If the predictor only needs to apply in some domain, then maybe the domain is easy and you can attack it more directly. But if that domain includes predicting intelligent agents, then it's obviously not easy.
  • Actually building an agent that solves these hard prediction problems will probably require building some kind of consequentialism. So it offers just as much opportunity to kill yourself as the general AI problem.
  • And if you don't explicitly build in consequentialism, then you've just made the situation even worse. There is still probably consequentialism somewhere inside your model, you just don't even understand how it works because it was produced by a brute force search.

I think that this argument is mostly right. I also think that many thoughtful ML researchers would agree with the substantive claims, though they might disagree about language. We aren't going to be able to directly train a simple model to solve all of the cognitive problems a human can solve, but there is a real hope that we could train a simple model to control computational machinery in a way that solves hard cognitive problems. And those policies will be "consequentialist" in the sense that their behavior is optimized to achieve a desired consequence. (An NTM is a simple mostly theoretical example of this; there are more practical instantiations as well, and moreover I think it is clear that you can't actually use full differentiability forever and at least some of the system is going to have to be trained by RL.)

I get off the boat once we start drawing inferences about what AI control research should look like---at this point I think Eliezer's argument becomes quite weak.

If Eliezer or Nate were to lay out a precise argument I think it would be easy to find the precise point where I object. Unfortunately no one is really in the position to be making precise arguments, so everything is going to be a bit blurrier. But here are some of the observations that seem to lead me to a very different conclusion:


Many decision theories, priors, etc. are reflectively consistent. Eliezer imagines an agent which uses the "right" settings in the long run because it started out with the right settings (or some as-yet-unknown framework for handling its uncertainty) and so stuck with them. I imagine an agent which uses the "right" settings in the long run because it defers to humans, and which may in the short term use incorrect decision theory/priors/etc. This is a central advantage of the act-based approach, and in my view no one has really offered a strong response.

The most natural response would be that using a wrong decision theory/prior/etc. in the short term would lead to bad outcomes, even if one had appropriate deference to humans. The strongest version of this argument goes something like "we've encountered some surprises in the past, like the simulation argument, blackmail by future superintelligences, some weird stuff with aliens etc., Pascal's mugging, and it's hard to know that we won't encounter more surprises unless we figure out many of these philosophical issues."

I think this argument has some merit, but these issues seem to be completely orthogonal to the development of AI (humans might mess these things up about as well as an act-based AI), and so they should be evaluated separately. I think they look a lot less urgent than AI control---I think the only way you end up with MIRI's level of interest is if you see our decisions about AI as involving a long-term commitment.


I think that Eliezer at least does not yet understand, or has not yet thought deeply about, the situation where we use RL to train agents how to think. He repeatedly makes remarks about how AI control research targeted at deep learning will not generalize to extremely powerful AI systems, while consistently avoiding engagement with the most plausible scenario where deep learning is a central ingredient of powerful AI.


There appears to be a serious methodological disagreement about how AI control research should work.

For existing RL systems, the alignment problem is open. Without solving this problem, it is hard to see how we could build an aligned system which used existing techniques in any substantive way.

Future AI systems may involve new AI techniques that present new difficulties.

I think that we should first resolve, or try our best to resolve, the difficulties posed by existing techniques---whether or not we believe that new techniques will emerge. Once we resolve that problem, we can think about how new techniques will complicate the alignment problem, and try to produce new solutions that will scale to accommodate a wider range of future developments.

Part of my view is that it is much easier to work on problems for which we have a concrete model. Another part is that our work on the alignment problem matters radically more if AI is developed soon. There are a bunch of other issues at play, I discuss a subset here.

I think that Eliezer's view is something like: we know that future techniques will introduce some qualitatively new difficulties, and those are most likely to be the real big ones. If we understand how to handle those difficulties, then we will be in a radically better position with respect to value alignment. And if we don't, we are screwed. So we should focus on those difficulties.

Eliezer also believes that the alignment problem is most likely to be very difficult or impossible for systems of the kind that we currently build, such that some new AI techniques are necessary before anyone can build an aligned AI, and such that it is particularly futile to try to solve the alignment problem for existing techniques.

Thanks, Paul -- I missed this response earlier, and I think you've pointed out some of the major disagreements here.

I agree that there's something somewhat consequentialist going on during all kinds of complex computation. I'm skeptical that we need better decision theory to do this reliably -- are there reasons or intuition-pumps you know of that have a bearing on this?

I mentioned two (which I don't find persuasive):

  1. Different decision theories / priors / etc. are reflectively consistent, so you may want to make sure to choose the right ones the first time. (I think that the act-based approach basically avoids this.)
  2. We have encountered some surprising possible failure modes, like blackmail by distant superintelligences, and might be concerned that we will run into new surprises if we don't understand consequentialism well.

I guess there is one more:

  1. If we want to understand what our agents are doing, we need to have a pretty good understanding of how effective decision-making ought to work. Otherwise algorithms whose consequentialism we understand will tend to be beaten out by algorithms whose consequentialism we don't understand. This may make alignment way harder.

Here's the argument as I understand it (paraphrasing Nate). If we have a system predict a human making plans, then we need some story for why it can do this effectively. One story is that, like Solomonoff induction, it's learning a physical model of the human and simulating the human this way. However, in practice, this is unlikely to be the reason an actual prediction engine predicts humans well (it's too computationally difficult).

So we need some other argument for why the predictor might work. Here's one argument: perhaps it's looking at a human making plans, figuring out what humans are planning towards, and using its own planning capabilities (towards the same goal) to predict what plans the human will make. But it seems like, to be confident that this will work, you need to have some understanding of how the predictor's planning capabilities work. In particular, humans trying to study correct planning run into some theoretical problems including decision theory, and it seems like a system would need to answer some of these same questions in order to predict humans well.

I'm not sure what to think of this argument. Paul's current proposal contains reinforcement learning agents who plan towards an objective defined by a more powerful agent, so it is leaning on the reinforcement learner's ability to plan towards desirable goals. Rather than understand how the reinforcement learner works internally, Paul proposes giving the reinforcement learner a good enough objective (defined by the powerful agent) such that optimizing this objective is equivalent to optimizing what the humans want. This raises some problems, so probably some additional ingredient is necessary. I suspect I'll have better opinions on this after thinking about the informed oversight problem some more.

I also asked Nate about the analogy between computer vision and learning to predict a human making plans. It seems like computer vision is an easier problem for a few reasons: it doesn't require serial thought (so it can be done by e.g. a neural network with a fixed number of layers), humans solve the problem using something similar to neural networks anyway, and planning towards the wrong goal is much more dangerous than recognizing objects incorrectly.

Thanks, Jessica. This argument still doesn't seem right to me -- let me try to explain why.

It seems to me like something more tractable than Solomonoff induction, like an approximate cognitive-level model of a human or the other kinds of models that are being produced now (or will be produced in the future) in machine learning (neural nets, NTMs, other etc.), could be used to approximately predict the actions of humans making plans. This is how I expect most kinds of modeling and inference to work, about humans and about other systems of interest in the world, and it seems like most of my behaviors are approximately predictable using a model of me that falls far short of modeling my full brain. This makes me think that an AI won't need to have hand-made planning faculties to learn to predict planners (human or otherwise), any more than it'll need weather faculties to predict weather or physics faculties to predict physical systems. Does that make sense?

(I think the analogy to computer vision point toward the learnability of planning; humans use neural nets to plan, after all!)

It seems like an important part of how humans make plans is that we use some computations to decide what other computations are worth performing. Roughly, we use shallow pattern recognition on a question to determine what strategy to use to think further thoughts, and after thinking those thoughts use shallow pattern recognition to figure out what thought to have after that, eventually leading to answering the question. (I expect the brain's actual algorithm to be much more complicated than this simple model, but to share some aspects of it).

A system predicting what a human would do would presumably also have to figure out which further thoughts are worth thinking, upon being asked to predict how a human answers a question. For example, if I'm answering a complex math question that I have to break into parts to solve it, then for the system to predict my (presumably correct) answer, it might also break the problem into pieces and solve each piece. If it's bad at determining which thoughts are worth thinking to predict the human's answer (e.g. it chooses to break the problem into unhelpful pieces), then it will think thoughts that are not very useful for predicting the answer, so it will not be very effective without a huge amount of hardware. I think this is clear when the human is thinking for a long time (e.g. 2 weeks) and less clear for much shorter time periods (e.g. 1 minute, which you might be able to do with shallow pattern recognition in some cases?).

At the point where the system is able to figure out what thoughts to think in order to predict the human well, its planning to determine which thoughts to think looks at least as competent a human's planning to answer the question, without necessarily using similar intermediate steps in the plan.

It seems like ordinary neural nets can't decide what to think about (they can only recognize shallow patterns), and perhaps NTMs can. But if a NTM could predict how I answer some questions well (because it's able to plan out what thoughts to think), I would be scared to ask it to predict my answer to future questions. It seems to be a competent planner, and not one that internally looks like my own thinking or anything I could easily understand. I see the internal approval-direction approach as trying to make systems whose internal planning looks more like planning understood by humans (by supervising the intermediate steps of planning); without internal supervision, we would be running a system capable of making complex plans in a way humans do not understand, which seems dangerous. As an example of a potential problem (not necessarily the most likely one), perhaps the system is very good at planning towards objectives but mediocre at figuring out what objective humans are planning towards, so it predicts plans well during training but occasionally outputs plans optimized for the wrong objective during testing.

It seems likely that very competent physics or weather predictions would also require at least some primitive form of planning what thoughts to think (e.g. maybe the system decides to actually simulate the clouds in an important region). But it seems like you can get a decent performance on these tasks with only primitive planning, whereas I don't expect this for e.g. predicting a human doing novel research over >1-hour timescales.

Did this help explain the argument better? (I still haven't thought about this argument enough to be that confident that it goes through, but I don't see any obvious problems at the moment).

I agree with paragraphs 1, 2, and 3. To recap, the question we're discussing is "do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?"

A couple of notes on paragraph 4:

  • I'm not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don't include built-in planning faculties.
  • You are bringing up understandability of an NTM-based human-decision-predictor. I think that's a fine thing to talk about, but it's different from the question we were talking about.
  • You're also bringing up the danger of consequentialist hypotheses hijacking the overall system. This is fine to talk about as well, but it is also different from the question we were talking about.

In paragraph 5, you seem to be proposing that to make any competent predictor, we'll need to understand planning. This is a broader assertion, and the argument in favor of it is different from the original argument ("predicting planners requires planning faculties so that you can emulate the planner" vs "predicting anything requires some amount of prioritization and decision-making"). In these cases, I'm more skeptical that a deep theoretical understanding of decision-making is important, but I'm open to talking about it -- it just seems different from the original question.

Overall, I feel like this response is out-of-scope for the current question -- does that make sense, or do I seem off-base?

Regarding paragraph 4:

I see more now what you're saying about NTMs. In some sense NTMs don't have "built-in" planning capabilities; to the extent that they plan well, it's because they learned that transition functions that make plans work better to predict some things. I think it's likely that you can get planning capabilities in this manner, without actually understanding how the planning works internally. So it seems like there isn't actually disagreement on this point (sorry for misinterpreting the question). The more controversial point is that you need to understand planning to train safe predictors of humans making plans.

I don't think I was bringing up consequentialist hypotheses hijacking the system in this paragraph. I was noting the danger of having a system (which is in some sense just trying to predict humans well) output a plan it thinks a human would produce after thinking a very long time, given that it is good at predicting plans toward an objective but bad at predicting the humans' objective.

Regarding paragraph 5: I was trying to say that you probably only need primitive planning abilities for a lot of prediction tasks, in some cases ones we already understand today. For example, you might use a deep neural net for deciding which weather simulations are worth running, and reinforce the deep neural net on the extent to which running the weather simulation changed the system's accuracy. This is probably sufficient for a lot of applications.

Thanks Jessica -- sorry I misunderstood about hijacking. A couple of questions:

  • Is there a difference between "safe" and "accurate" predictors? I'm now thinking that you're worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.

  • My feeling is that today's current understanding of planning -- if I run this computation, I will get the result, and if I run it again, I'll get the same one -- are sufficient for harder prediction tasks. Are there particular aspects of planning that we don't yet understand well that you expect to be important for planning computation during prediction?

  • A very accurate predictor will be safe. A predictor that is somewhat accurate but not very accurate could be unsafe. So yes, I'm concerned that with a realistic amount of computing resources, NTMs might make dangerous partially-accurate predictions, even though they would make safe accurate predictions with a very large amount of computing resources. This seems like it will be true if the NTM is predicting the human's actions by trying to infer the human's goal and then outputting a plan towards this goal, though perhaps there are other strategies for efficiently predicting a human. (I think some of the things I said previously were confusing -- I said that it seems likely that an NTM can learn to plan well unsafely, which seems confusing since it can only be unsafe by making bad predictions. As an extreme example, perhaps the NTM essentially implements a consequentialist utility maximizer that decides what predictions to output; these predictions will be correct sometimes and incorrect whenever it is in the consequentialist utility maximizer's interest).

  • It seems like current understanding of planning is already running into bottlenecks -- e.g. see the people working on attention for neural networks. If the NTM is predicting a human making plans by inferring the human's goal and planning towards this goal, then there needs to be some story for what e.g. decision theory and logical uncertainty it's using in making these plans. For it to be the right decision theory, it must have this decision theory in its hypothesis space somewhere. In situations where the problem of planning out what computations to do to predict a human is as complicated as the human's actual planning, and the human's planning involves complex decision theory (e.g. the human is writing a paper on decision theory), this might be a problem. So you might need to understand some amount of decision theory / logical uncertainty to make this predictor.

(note that I'm not completely sold on this argument; I'm attempting to steelman it)

Thanks Jessica, I think we're on similar pages -- I'm also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.

If the NTMs get to look at the predictions of the other NTMs when making their own predictions (there’s probably a fixed-point way to do this), then maybe there’s one out there that copies one of the versions of 3 but makes adjustments for 3’s bad decision theory.

Why not say "If X is a model using a bad decision theory, there is a closely related model X' that uses a better decision theory and makes better predictions. So once we have some examples that distinguish the two cases, we will use X' rather than X."

Sometimes this kind of argument doesn't work and you can get tighter guarantees by considering the space of modifications (by coincidence this exact situation arises here), but I don't see why this case in particular would bring up that issue.

Suppose there are N binary dimensions that predictors can vary on. Then we'd need predictors to cover every possibility. On the other hand, we would only need to consider N possible modifications to a predictor. Of course, if the dimensions factor that nicely, then you can probably make enough assumptions about the hypothesis class that you can learn from the experts efficiently.

Overall it seems nicer to have a guarantee of the form "if there is a predictable bias in the predictions, then the system will correct this bias" rather than "if there is a strictly better predictor than a bad predictor, then the system will listen to the good predictor", since it allows capabilities to be distributed among predictors instead of needing to be concentrated in a single predictor. But maybe things work anyway for the reason you gave.

(The discussion seems to apply without modification to any predictor.)

It seems like "gets the wrong decision theory" is a really mild failure mode. If you can't cope with that, there is no way you are going to cope with actually malignant failures.

Maybe the designer wasn't counting on dealing with malignant failures at all, and this is an extra reminder that there can be subtle errors that don't manifest most of the time. But I don't think it's much of an argument for understanding philosophy in particular.

I agree with this.

My plan for solving value loading is:

  1. Formalize the meaning of "a value system for a physicalist agent" and "a physical agent with value system X."
  2. Use this formalism to specify a procedure that transforms a given physicalist agent into its (a priori unknown) value system. The agent can be specified either as a black box (observations of the agent) or a white box (source code of the agent) or some combination of both. It seems way too risky to rely on solving whole brain emulation as a precondition to value loading, so the ultimate solution is probably not going to be entirely white box (also white box approaches suffer from something I call the "Jekyll and Hide" problem).

Obviously this approach directly relies on solving "naturalized AIXI" which IMO means formalising the updateless intelligence metric (with some substantial corrections like using matrix counterfactuals instead of averaging over subjective universe hypotheses and possibly incorporating these ideas in some way I don't yet understand) in the language of optimal predictors.

On the other hand, concerns about consequentialist agents breaking out can be mitigated using Christiano's cryptographic box.

I doubt that this plan will work (see this and this). The problem looks way too difficult; it requires not only modelling rational behavior, but also modelling human imperfections (I expect this to require a huge amount of work that looks more like psychology research than decision theory). I'd consider it if other approaches looked less promising, though.

One of the reasons this looks so hard is that there is no way to tell if your procedure for extracting the value system of an agent is wrong, since there is no feedback. You pretty much have to solve cognitive science on the first go (including solving decision theory and finding the "one true prior", both physical and logical). This is not required for short-term human predictions, which you can test against reality (so things are probably fine if your prior for human behavior is only somewhat wrong). While of course there could be flaws in the procedure for using these predictions (e.g. human-imitation or approval-direction), designing a reliable such procedure looks like a much easier problem to me than solving cognitive science.

It's possible that there's some more conservative solution to the easy goal inference problem that doesn't require committing to e.g. a particular prior or a particular model for human irrationality (maybe just a class of such priors/models), yet still gives you some amount of utility (in a minimax fashion). This seems promising in comparison to any solution that requires committing to such things in advance. It's plausible that research related to defining "a physicalist agent with value system X" would help here, though I'm not too optimistic.

The homomorphic encryption idea seems useful for being safe even when there are flaws in the virtual machine you run AIs in.

I dispute the claim that we necessarily need psychology research in order to solve goal inference. If you can specify the success criteria of such a research then you can make these criteria part of your specification of the goal. This way you effectively delegate the psychology research to the AI.

Imagine we found a species of intelligent aliens, call them Zerg. Zerg have their own values and their own cognitive biases, dissimilar from humans. Suppose you decide to understand their values, given that you have easy access to them and they are fully cooperative in this endeavour. It seems likely that after gathering enough evidence and investing enough time analyzing it (imagine thousands of brilliant people working on it for many generations) you will eventually come to satisfactory understanding.

Suppose you can repeat the procedure with another alien species and a third one. The entire process is now not dissimilar from studying nature and discovering physical law. But we believe that the process of discovering physical law can be mathematically specified (in the spirit of Solomonoff induction) and automated. Therefore, it seems like that the same is true for the process of discovering the unknown value system of a given agent.

What is the alternative possibility? That is impossible for us (humans) to understand Zerg values even in principle. But what would it even mean? That Zerg values are unknowable for someone who is not Zerg and human values are unknowable for someone who is not human? While I don't have a way of completely disproving this possibility, it seems implausible. For one thing, it smacks of Cartesian dualism. For another, it implies there is no way for me to understand the value system of another human, not even within a reasonable approximation (since all we need to build FAI is a reasonable approximation, a model of values sufficiently precise to create Eutopia instead of Dystopia). This feels a lot like assuming that you are the only person who has consciousness: not clear how to entirely disprove it but seems like something you should assign a very low prior probability.

Paul writes that "...this approach rests on an optimistic assumption: that it’s possible to model a human as an imperfect rational agent, and to extract the real values which the human is imperfectly optimizing." However I don't understand how it's possible to have a coherent worldview without this assumption. If there are no real values which I am imperfectly optimizing, what does it mean for any choice I make to be right or wrong? In what sense is life better than death or a future of humans living fulfilling lives better than a future of paperclip factories? When you say "some amount of utility", what is the utility you are talking about? Yes, it is difficult to specify the precise difference between "bias" and "true preference". However if there is no real difference, why do we call it bias? It is one thing to say we don't know the precise answer at the moment (but only a vague intuitive idea). It's quite another to say the answer doesn't exist even in principle (which amounts to moral nihilism). But if the answer does exist, who can be better at finding it than a superintelligent AI?

I agree that the testing problem is very hard. One thing we can do is use a superintelligent math oracle to compare the value of formally specified scenarios with respect to the utility function under test. However this will only work for scenarios which are sufficiently simple, which leaves lots of room for perverse instantiation. On the other hand I think that given a utility function for which there are principled reasons to believe it is correct, tests on simple scenarios are much stronger evidence than similar tests for heuristic AI. Another approach is trying to formally specify the concept of "manipulation" and use this to produce "unmanipulative" answers to questions about the utility function. However I have only rudimentary ideas about how to do it.

Paul's approach seems to be that instead of thinking about the consequences of a plan we should be thinking about the process producing and/or approving the plan. However I think it is questionable that the process of generating / approving plans has a more fundamental role than the consequences of a plan. The fundamental reason building a paperclip maximizer is bad is that this plan has the consequence of killing everyone, and it makes no difference how to plan was produced or who approved it. That said, I'm in favor of exploring all approaches that are at least somewhat promising.

Optimizing short-term instrumental preferences seems to me very dangerous. It's akin to giving an ape access to a control panel for nuclear missiles: humans are notoriously unreliable. I think this might be an acceptable method of blocking UFAI for a certain period but is unlikely to lead to good results in the long run.

Optimizing human approval is prone to marketing worlds. It seems less dangerous than physicalist AI in the sense that it doesn't create incentives to take over the world, but it might produce some kind of a hyper-efficient memetic virus.

Another angle of attack is using AGI to predict the results of very long human deliberation, in the style of Paul's indirect normativity. This approach has some failure modes e.g. it might happen that it's computationally infeasible to produce exact predictions and it's hard to specify what a valid "approximate prediction" means without solving goal specification in a different way. Nevertheless it seems definitely worth exploring.

Nevermind the aliens, if I look at another human or even myself I don't think my behavior is well-represented by any map from states of affairs to "how much I like that state of affairs."

I don't even really believe that I can extract information about my preferences beyond what you could get by just asking me questions. Such questions can contain a lot of information "about" my preferences---there are many things I really don't like, many things I do like, many things I would consider bad even if I liked them at the time, many explicit dependences of my preferences on unknown physical facts, and so on. But they don't at all define a utility function.

It seems like we are somehow talking about different things. Here are some very simple models on which humans are not imperfectly maximizing fixed preferences: humans are composed of interacting processes with different preferences, have time-varying preferences, have preferences which depend on the space of available options, have conflicting preferences-about-their-preferences that will evolve in a path-dependent way upon reflection.

I expect the real situation to be much messier than any of those simple models, such that the correct answer to "what do people want" is much less well-defined, but even in these simple models it sounds to me like you are saying that there is no sense in which decisions are right or wrong. I doubt you believe that, and so it seems like we must somehow be talking past each other.

The comparison to Solomonoff induction seems strange---I thought we were already taking for granted the ability to do powerful inference, and the question is "what then?" To what problem do you apply these powerful inference abilities?

Are you expressing optimism that the appropriate problem statement will become clear as we have more powerful AI techniques? Optimism that it will become clear if we think about it more? Something else? It certainly doesn't seem clear yet. I don't think that AI progress will make it much easier. So, if this sort of thing is going to work, then it seems like we should be able to make headway now. Of course this is what led me to my proposed formalization of indirect normativity.

Obviously a plan is mostly evaluated in terms of its consequences. We approve of plans that have consequences we like.

I assume it is infeasible to produce exact predictions about the outcome of very extensive deliberation. If we defined U as the output of such a process, we would build a system which maximizes U(a) given its uncertainty about U---not one which tries to predict the behavior of U. To me, this seems to capture the appropriate notion of "approximate prediction."

The question whether a behavior is well-represented by a utility function is moot since it's not clear what is meant by "well-represented." The real question is whether a utility function is a good prescriptive model.

The behaviors which are incompatible with maximizing a utility functions are commonly known as "cognitive biases." The very name suggests that these behaviors are considered undesirable, at least from the perspective of their consequences. That is, humans don't necessarily want to be modified into ideal rationalists because certain properties of the mind are terminal values by themselves, but the consequences of biased behavior are bad, ceteris paribus.

Therefore it seems reasonable to assume that the correct model of good vs. bad decisions would be a model that compares the decisions in terms of the expected value of some utility function. This is the utility function I want to extract.

The next question is whether this utility function can be inferred from human behavior. I think that most likely it can because otherwise we are forced into the unsavory philosophical position that values exist on some plane outside the physical universe.

Finally, there is the question of how to infer the utility function from human behavior. Here I don't have a definite answer (although I have a general idea), but it seems that the answer will look like the decomposition of human behavior into rational and irrational components. After all, the usual response to observing cognitive bias in your own behavior is not "I don't know what I want" but "it was stupid of me to do X, I should do Y instead."

The comparison to Solomonoff induction was not meant to suggest "just apply Solomonoff induction." It was meant as a meta-level analogy. The process of inferring physical law from observation was a vaguely defined intuitive process, but Solomonoff induction provided a rigorous mathematical model (although it is not naturalized). Similarly the idea of "human values" is a vague intuitive concept but I expect to be able to rigorously define it in the future.

I don't necessarily say that the specification of the utility function will be perfectly crisp. Solomonoff induction is also not perfectly crisp since it relies on a choice of model of computation. But all "non-contrived" models lead to similar priors, similarly I expect all "non-contrived" choices in the definition of the utility extraction process to lead to similar functions (similar enough to get Eutopia from any superintelligent maximizer in this class).

The optimism I'm expressing is that I'm going to have a rigorous definition of the updateless intelligence metric soon enough. This brings us close to the objective since we can ask "for which utility function the given agent has high intelligence "? Of course there are a lot of details here that are unclear (do we have a prior over s, do we need to introduce other variables besides - like variables representing the agent's prior etc.) but I expect them to become clear in the future.

Regarding indirect normativity, I agree this is a viable approach but it relies on having a theory of physicalist agents in the context of which we are looking for a utility function. So, this might be an alternative to inverse optimization of the intelligence metric but it is not an alternative to creating a theory of physicalist agents.

I agree that there exist utility functions such that we'd be happy if they were maximized. The question is just how to construct one. I agree that observing human behavior is the way to go (after all, what else would you do?)

I get off the boat only when we start talking about the details of the inference procedure, e.g. if we propose to construct such a utility function directly by solving some inverse problem on human behavior. I say "directly" because I agree that solving inverse planning problems may be a useful ingredient.

As a special case, I am skeptical of finding a satisfactory by maximizing .

I may understand this proposal better if we discuss some concrete cases. For example, suppose that I apply it to a committee consisting of several humans. In this setting, the "right" answer seems to be the result of some bargaining process amongst the committee members. Do you expect to extract an appropriate solution concept by asking "what preferences are the committee most effectively satisfying"?

If so, I don't yet see how that would work. If not, what kind of answer do you expect your procedure to produce in cases like this one, and why think that this answer will be right?

The core of my disagreement with Jessica was about the importance of creating a theory of physicalist agents (without which it is meaningless to talk about utility functions that apply to physical reality). So, if I understood you correctly we don't have a cardinal disagreement here, especially if you allow that "solving inverse planning problems may be a useful ingredient."

Now, regarding the committee.

Consider an agent playing a game in extensive form. Let be the set of outcomes (leafs of the game tree) and be the utility function. Assuming is perfectly rational, it will choose a policy s.t. the probability distribution induced by maximizes . In reality we are interested in which are not perfectly rational but I am going to ignore this because I'm only addressing the "committee" aspect (which will soon follow). Denote the space of probability distributions on and suppose we know the convex set (it is convex because I assume we can mix policies). Then there is a convex cone of utility functions which yield as an optimal policy (i.e. ) and . Therefore we can imagine a "utility inference process" s.t. for all , lies in the interior of (for example might be some kind of centroid of ). Fixing and , generically becomes more narrow as the game becomes more complex, so for sufficiently complex games will be a good approximation of .

Now let's suppose the game is played by a collection of agents with respective utility functions . The agents are cooperative so they choose a mutual policy s.t. is Pareto efficient. It is easy to see that in this case there are s.t. and . So is a utility function that explains the behavior of the committee and that is approximately a convex linear combination of the members' utility functions.

If you optimize on some (the larger set of mixed outcomes accessible to the superintelligence) you get a Pareto efficient result so you can think of it as a "bargaining process." On the other hand, Nash bargaining in doesn't translate to Nash bargaining in . I'm not sure whether the latter is a serious flaw in the approach.

I agree that rational agents will eventually converge to behavior that is optimal according to some utility function, under suitable consistency conditions. Assuming that convergence has already occurred is dodging the difficulty I'm trying to point at.

My approach to indirect normativity is to allow the process of reflective deliberation to continue long enough that we expect it to have converged adequately. This doesn't require any knowledge about the structure of the agents, just the ability to make predictions about what will happen as they reflect. (Reflection would include, for example, whatever deliberations the committee engages in in order to arrive at an effective compromise policy.)

You seem to be offering an alternative, which you hope will dodge some of the difficulties of my formulation of indirect normativity by leveraging something about the nature of rational behavior. If I am making observations of the committee's performance, and the committee (like almost any real committee) has not converged to an efficient compromise policy, then it seems like your solution concept is supposed to recover this bargaining solution from the committee's actual decisions. But it isn't clear to me how that could work, except by inferring how the committee would reflect and then running that reflective process.

I am, of course, not convinced that we need a theory of physicalist agents (since if we run this reflective process then we might as well directly extract preferences over actions, folding the evaluation of consequences into the implicit predictions about these preferences over actions). It would be nice to have a better theoretical understanding. But that's why I'm not working on the problem.

I think that inverse optimization done right will implicitly recover behavior under long deliberation without the need to specify a particular deliberation environment in detail.

Consider a single imperfectly rational agent (I think that in light of my previous illustration there is no difference of principle between a single imperfectly rational agent and an imperfectly rational committee). I model the irrationality of the agent as the result of bounded computing power + a certain complex choice of logical prior. Specifically, consider again a game in extensive form but suppose the set of outcomes represents a set of "impossible possible worlds" and where is the space of "possible worlds." Each node in the game tree is controlled either by or by Nature. There are two probability distributions over Nature's actions: the probability distribution assumed by which includes 's logical uncertainty and the "true" (logically omniscient) probability distribution . I assume that dominates and that is s.t. regardless of 's action only outcomes in receive non-zero probability. Then, the same as before gives us a utility function which can be restricted to to yield the "true" utility function . This means that will be predominantly influenced by game scenarios in which deliberates a lot and updates its logical knowledge since other game scenarios are likely to yield outcomes outside of (likely according to ).

As a simple example, suppose where given , = 0 (respectively 1) means the physics of the universe favors conventional medicine over homeopathy (respectively homeopathy over conventional medicine), (respectively 1) means a double-blind controlled experiment came out in favor of conventional medicine (respectively homeopathy) and (respectively 1) means decided to take a conventional pill for its illness (respectively a homeopathic pill). . rewards outcomes in which .

According to , there is a 90% probability that . is given a direct observation of and then the game branches (according to Nature's whim): in one branch has no time to deliberate and in a different branch it has time to deliberate and make the logical observation that either or (decided by Nature). In the no-deliberation branch and are -independent variables. In the deliberation branch and are -coupled according to the logical observation made. On the other hand assigns probability 1 to the logical observation and to in either branch (of course in a more realistic model wouldn't be a logical necessity, only something likely w.r.t. a simplicity prior).

In this example, will choose the homeopathy pill in the no-deliberation branch and either the homeopathy or the conventional pill in the deliberation branch, depending on the logical observation it makes (of course according to it always chooses the conventional pill in the deliberation branch). For an appropriate choice of , will be much more -likely to choose a homeopathy pill. Nevertheless, the extracted will prefer the conventional pill.

One question you may ask is how do I know this type of model is a good fit for human rationality. Obviously at this point I don't have a complete answer but note that optimal predictors are defined by their asymptotic behavior which suggests that choosing a logical prior can yield arbitrary behavior in the short-deliberation mode. This leads to the converse objection that simultaneously looking for a utility function and a logical prior can lead to overfitting. Again, I don't have an impeccable argument but I think that the correct formalism will strike a good balance between underfitting and overfitting because assigning human values to humans seems natural from the perspective of an external observer even when this observer never directly observes extremely long deliberation or has an external reason to care about such a scenario (which is the point I was trying to make with the Zerg example).

On the other hand, you made me realize that using hypothetical long deliberation for defining preferences over an action space (e.g. preferences over the set of possible programs for a physicalist AGI) might be a viable approach, although not without its own difficulties. I updated towards "it is possible to solve FAI without a theory of physicalist agents."

EDIT: See my reply to Wei Dai for what I think is a strong argument why we can't do without a theory of physicalist agents after all.

Suppose I'm studying Zerg values, and I find that many Zergs think their values depend on the solutions of various philosophical problems (the degree and nature of these dependencies are themselves open problems). It seems that the following are some potentially feasible ways for me to proceed:

  1. Wait for the Zerg to solve the problems themselves and tell me the solutions.
  2. Work out (or have the Zerg work out) an algorithm for philosophical reasoning or a formal definition of what it means for a solution to a philosophical problem to be correct, then apply computing power.
  3. If my own philosophical abilities are sufficiently similar or superior to the Zerg's, apply them.
  4. Learn to do philosophy from the Zerg through imitation and/or other applicable forms of learning, then help them solve the philosophical problems.
  5. Upload some Zergs into a computer and simulate them until they solve the problems.

Can you explain how your approach would solve this problem? Which of the above (if any) is the closest analogy to how your approach would solve this problem?

I think 5 is the closest analogy (in which respect it is similar to Paul's approach).

Philosophical problems are questions about concepts that don't have clear-cut semantics, but the agent implicitly uses some rules to reason about these questions (I think this metaphilosophical perspective is essentially the same as Wittgenstein's "language games"). Therefore, having a model of the agent it is possible to deduce the rules and derive the answer.

For example, suppose the Zerg consider rubes sacred and bleggs abhorrent. The Zerg consider it indisputable that red cubes are rubes and blue egg-shaped objects are bleggs. The status of blue cubes is controversial: some Zerg argue that they are rubes, some that they are bleggs, some that they are neither and many are simply unsure. However, it is a logical fact that given sufficient time to deliberate, the Zerg would arrive at the conclusion that blue cubes are rubes.

This situation is structurally very similar to the example with homeopathic vs. conventional pills. It seems natural to model the Zerg as having logical uncertainty about the sentence "blue cubes are rubes" which means that if you can predict that given enough time they would update towards this sentence, you should assign positive Zerg-value to blue cubes.

By "Paul’s approach" you meant Paul's "indirect normativity" approach, rather than his current "act-based" approach, right?

Can you compare your approach and Paul's "indirect normativity" approach in more detail? In Paul's approach, the FAI programmer specifies a utility function U which explicitly incorporates a simulated human in a virtual environment. I can certainly see how 5 is the closest analogy for his approach. In your approach, if I understand correctly, U is to be inferred by an algorithm using observations of human behavior as input. Are you expecting this U to also incorporate a simulated human? How do you expect your U to be different from Paul's U? For example are the simulations at different levels of detail? Or is the main difference that you expect your approach to also produce a logical prior that will cause the FAI to behave in a reasonable way while it lacks enough computing power to reason deeply about U (e.g., while it can't yet compute the outcome of the simulated human's philosophical deliberations)?

Yeah, I was referring to indirect normativity.

The main advantage of my approach over Paul's is in terms of the projected feasibility of creating an AGI with such a utility function.

Paul considers a specific simulation that consists of a human in a virtual environment in which the human deliberates over very long time, eventually producing a formal specification of a utility function using some kind of output channel the virtual environment provides ("virtual terminal"). From the AGI point of view the entire simulation is an arbitrary (possibly randomized) program that produces a string of bits that it interprets as the specification of a utility function, a program so costly that it cannot be executed (but presumably it is still able to reason about it).

This presupposes a way to formally specify a human. Such a way may either be constructed by hand (which requires a lot of neuroscience) or produced by the AGI from observations (although this requires care to avoid vulnerability to simulation warfare). In this respect the two approaches are quite similar.

My approach doesn't assume either a virtual environment or a language in which the human specifies the utility function. Instead, the utility function is inferred directly from the specification of a human. Essentially the AGI is looking for the utility function for which it would be most reasonable to say (in some specific formal sense) that "the given program is an intelligent agent with utility function ."

In order to find the right utility function, we need to first understand what type of object is "a utility function for an AGI." This means both understanding the domain of the utility function and the properties the utility function must have. The first question is non-trivial mostly for physicalist agents, the second question is non-trivial for "logic optimizers" as well.

I expect that in order for it to be feasible to construct an AGI with given utility function, the utility function must belong to some complexity class. Presumably an agent should be able to reason about a utility function it cannot evaluate using logical uncertainty. Let's examine this using the optimal predictor framework. Any function has an optimal predictor, however in general this predictor requires advice as expensive as the function itself. Thus even though agents with utility functions of arbitrary computational complexity are physically permissible, it is not feasible to construct such an agent given reasonable computational resources (i.e. the impediment is not the computational resources of the agent but the computational resources of the process constructing the agent, even though for self-improving agents it might be essentially the same). On the other hand, we know that some complexity classes (most notably the class of generatable problems) admit uniform (adviceless) optimal predictors. This suggests that any feasible utility function has to be in such a class.

Strictly speaking, complexity classes are defined by asymptotic behavior of resource requirements. So a problem that be solved within time is still in the class . In particular, we can consider utility functions with good asymptotic behavior and terrible "practical" behavior. I think that in this case we will be able to create an agent that can unboundedly self-improve in theory but in practice it will require infeasible time for it to grow from seed to anything useful.

The complexity class of Paul's utility function depends on the language in which the simulated human specifies their output. However regardless of the language there is an enormous additive constant associated with running the simulation itself. If there is uniform optimal predictor for e.g. then the price we pay is only a polynomial in . However, this seems unlikely. If there was some a priori known structure in the simulation that can be exploited, there would be a chance of using an optimal predictor for some smaller class. However it doesn't seem Paul's construction provides any such structure.

On the other hand my approach seems to supply a lot of structure, and specifically there is a good chance that it is generatable. Generatable function are functions the graph of which can be sampled, which is more or less the same as functions admitting supervised learning (indeed my optimal predictor is essentially a primitive "universal supervised learning" algorithm). From this perspective, my approach consists of first training a utility inference algorithm on agents with utility functions randomly sampled from some prior and then applying this algorithm to a human.

  • As usual, I imagine the human outputting a function V : actions , rather than trying to resolve "what kind of object is a utility function?" in advance. I am not yet convinced that there is any problem with this.
  • I agree that such an abstractly-defined utility function is not implementable using any known techniques.
  • Simple/generatable functions are indeed easier to optimize, but I think we should just talk directly about why they are easier to optimize. They are easier to optimize because you can explicitly compute them in order to provide feedback.
  • I'm still not convinced that there is a working formulation along the lines that you propose. But if there is, then it seems clear that the instance for actual humans will be much too complex for us to actually compute with. It will also have other huge differences. For example, a successful agent should use information from the world around it in order to reason about humans, but this will not be the case for the other instances you can sample. So overall it seems to me like your approach will still be relying on relatively strong assumptions about transfer learning.
  • Once we are already relying on such strong assumptions about transfer learning, it seems more promising to go straight for training an algorithm for maximizing arbitrary symbolically defined goals.
  • As usual, I am skeptical about this whole approach, and think that it should be at best plan B if we can't get something with more direct feedback to work.
  • In particular, it seems to me that you won't be able to build an efficient agent out of your approach unless either you solve this problem, or else you get a lucky break with respect to which techniques prove to be efficient. Do you have a different sense, or are you just more willing to tolerate inefficiency?

"As usual, I imagine the human outputting a function V : actions →[0,1]→[0,1], rather than trying to resolve 'what kind of object is a utility function?' in advance."

Yeah, I realize this. This is why I pointed out that the domain problem is irrelevant for logic optimizers but the complexity class problem is still very relevant.

"I agree that such an abstractly-defined utility function is not implementable using any known techniques... it seems more promising to go straight for training an algorithm for maximizing arbitrary symbolically defined goals"

I strongly suspect this is infeasible for fundamental complexity-theoretic reasons.

"Simple/generatable functions are indeed easier to optimize, but I think we should just talk directly about why they are easier to optimize. They are easier to optimize because you can explicitly compute them in order to provide feedback."

Depends on the meaning of "compute." Generatable functions allow efficiently sampling pairs s.t. . This is very different from being able to efficiently evaluate for a given . It seems likely that the evaluation of a generatable function can only be done in exponential time in general.

" seems clear that the instance for actual humans will be much too complex for us to actually compute with."

I'm not sure I understand you. Perhaps you assume that we provide the AGI an inefficient specification of a human that it cannot actually run even for a relatively short timespan. Instead, either we provide it an efficient whole brain emulation or we provide observations and let the AGI build models itself. Of course this is very prone to mind crime but I think that your approach isn't immune either.

"...a successful agent should use information from the world around it in order to reason about humans, but this will not be the case for the other instances you can sample."

You mean that since humans are adapted towards their current environment it is important to understand this context in order to understand human values? I believe this can be addressed in my framework. Imagine that the supervised learning procedure samples not just agents but environment + agent couples with the agent adapted to the environment to some extent. Feed the algorithm with observations of the agent and the environment. After training, feed the algorithm with observations of humans and the real environment.

"So overall it seems to me like your approach will still be relying on relatively strong assumptions about transfer learning."

I don't think so. An optimal predictor produces estimates which are as good as any efficient algorithm can produce, so if this problem indeed admits an optimal predictor (e.g. because it is generatable) then my agent will succeed at value inference as much as anything can succeed at it.

"In particular, it seems to me that you won’t be able to build an efficient agent out of your approach unless either you solve this [informed oversight] problem..."

The solution to the informed oversight problem is easy if is generatable. In such cases you cannot construct a satisfactory but you can still provide with correct input-output pairs. In particular it works in your toy model because it's easy to produce triplets of the form and add them to the training set. Actually it works exactly for the same reason your is able to exploit the naive approach.

In such cases you cannot construct a satisfactory but you can still provide with correct input-output pairs.

I agree that you can train a predictor to make an honest effort to predict , but that is not good enough to train a policy to make an honest effort to maximize .

  • You can train a model to predict from . The trained model will in some sense be optimal, but it won't necessarily be accurate. That is, there need not be any efficient algorithm to distinguish from with . (Assuming one-way functions exist.)
  • The real question is how we train a policy . If we have a sufficiently good predictor of then we can use its predictions as a training signal. But in general there does not exist any sufficiently good predictor. So what do we do now?

You are right! It is insufficient for to be generatable. What we need is the existence of an efficient randomized algorithm s.t. when executed produces a sample of and then waits for receiving input , after receiving which it gives an estimate of . More formally produces a pair where is a program and . Such a can be used directly to train a maximizer.

Luckily my approach is within this class (at least superficially; but this lends hope for success after a deeper analysis). When you sample a utility function, a world and an agent and "render" a sequence of observations, you should be able to evaluate the output of the algorithm under training according to the selected utility function.

Actually, I think that if we consider only deterministic maximization policies then an optimal predictor for wrt a bounded-Somonoff-type measure is sufficient to get an optimal maximization policy. In this case we can do maximization using Levin's universal search . A significantly better maximization policy cannot exist since it would allow us to improve our estimates of and/or .

Of course under standard assumptions about derandomization the best deterministic policy is about as good as the best random policy (on average), and in particular requiring is generatable wrt a suitable measure is sufficient. Random only gives significant advantage in adversarial setups: your toy model of informed oversight is essentially an adversarial setup between and .

Also, obviously Levin search is grossly inefficient in practice (like the optimal predictor which is basically a variant of Levin search) but this model suggests that applying a more practical learning algorithm would give satisfactory results.

Also, I think it's possible to use probabilistic policies which produce computable distributions (since for such a policy outputs with low and high probabilities are always distinguishable).

Yes, this is what you need. I agree that your setup nearly has this property, in that you get out a complete description of the utility function which can be used to evaluate an action.

But my main concern is that even with a complete description of the utility function in hand, you still can't efficiently evaluate it on possible actions, at least not without some further assumptions about the utility function (which I don't think hold).

That is, I am concerned that even once you know the utility function, it will have the same character as the U in my toy model (because of logical uncertainty).

As a complexity-theoretically clean example, suppose that I am really unhappy if an output has a steganographically encoded message. Some policies will predictably produce steganographically encoded messages and therefore should receive low utility, but I can't actually distinguish these messages from random messages.

This example is a bit unsatisfying, because steganographically encoding a message is harder than generating a random message. The novelty example in my post is more satisfying in that sense, but it introduces a much smaller complexity gap. It would be interesting to find a plausible example that has the best of both worlds. If there is no example, it would be very interesting to understand why there isn't.

As I said, I expect it's not meaningful to consider agents with arbitrary utility functions. As a working hypothesis, we can assume the utility function should be "sequentially generatable" i.e. have the property I previously described. The nice thing is that if we sample sequentially generatable utility functions then the total utility function (which includes utility inference in its definition) is also sequentially generatable.

To see this, consider a simple setup in which our agent is exposed to observations of "teacher" agent after which receives input and produces output . The total utility function has the following "sequential generator" . samples a sequentially generatable utility function with sequential generator . It then constructs an agent with utility function and runs the agent to generate . It uses to generate . After receiving , it uses to compute an estimate of which thus serves as an estimate of .

It is possible there is a larger complexity class of admissible utility functions (although it's possible that there isn't) but in that case I still hope the utility inference procedure or some variant thereof still preserves this class.

I still agree it would be interesting to construct a complexity-theoretic parallel of your toy model for informed oversight. The following is my attempt to do it (which I'm not sure really works).

Consider a directed graph where each -valent vertex is labeled by a tensor of rank with rational coefficients (you can also use other coefficients e.g. a finite field) where each covariant index of the tensor corresponds to an incoming edge and each contravariant index to an outgoing edge. It is thus possible to contract the tensors according to the graph structure obtaining a number .

There is a natural concept of isomorphism for graphs as above s.t. implies . That is, we consider regular graph isomorphisms where relabeling the edges transforms the tensors in the obvious way (it is also possible to consider a more general type of isomorphism where we are also allowed to do a "base change" at each edge, but it is not compatible with the following).

Given such a graph , it is possible to apply a one-way function to each coefficient of each tensor (otherwise preserving all structure), getting a new graph . Obviously implies .

Suppose the input is a as above and the output is another such graph . when but not and otherwise.

Clearly given it's easy to construct a random isomorphic . However it might be hard to check for this kind of isomorphism (although regular graph isomorphism is claimed to be solvable in quasipolynomial time and might easily be solvable in polynomial time to the best of our knowledge). It is also likely to be hard to construct a non-isomorphic with .

I don’t think so. An optimal predictor produces estimates which are as good as any efficient algorithm can produce, so if this problem indeed admits an optimal predictor (e.g. because it is generatable) then my agent will succeed at value inference as much as anything can succeed at it.

This is true asymptotically. The same argument can be applied to training an algorithm for pursuing symbolically defined goals, even using conventional techniques. The problem is that quantitatively these arguments have no teeth in the applications we care about, unless we can train on instances as complex as the one we actually care about.

I don't know if we disagree here. We probably reach different conclusions because of our disagreement about whether you can generate instances of value inferences that are as complex as the real world.

I strongly suspect this is infeasible for fundamental complexity-theoretic reasons.

Humans can be motivated to effectively optimize arbitrary abstractly defined goals. And while pursuing such goals they can behave instrumentally efficiently, for example they can fight for their own survival or to acquire resources just as effectively as you or I.

Perhaps you can't get the particular kind of guarantee you want in these cases, but it seems like we have a proof of concept (at least from a complexity-theoretic perspective).

ETA: Practically speaking I agree that a symbolic specification of preferences does not solve the problem, since such a specification isn't suitable for training. The important point was that I don't believe your approach dodges this particular difficulty.

"Humans can be motivated to effectively optimize arbitrary abstractly defined goals."

I disagree. Humans can optimize only very special goals of this type. Our intuition about formal mathematical statements is calibrated using formal proofs. This means we can only estimate the likelihood of statements with short proofs / refutations.

To make it formal, consider a formal theory and an algorithm that, given a certain amount of time to work, randomly constructs a formal proof / refutation of some sentence (i.e. it starts with the axioms and applies a random* sequence of inference rules). produces sentences together with a bit which says the sentence is provable or refutable. This means defines a word ensemble (with parameter ) w.r.t. which the partial function is generatable. In particular, we can construct an optimal predictor for this function. However, this optimal predictor will only be well-behaved on sentences that are relatively likely to be produced by i.e. sentences with short proofs / refutations. On arbitrary sentences it will do very poorly.

*Perhaps it's in some sense better to think of that randomly samples a program which then controls the application of inference rules, but it's not essential for my point.

I agree that you almost certainly can't get an optimal predictor. For similar reasons, you can't train a supervised learner using any obvious approach. This is the reason that I am pessimistic about this kind of "abstract" goal.

That said, I'm not as pessimistic as you are.

Suppose that I define a very elaborate reflective process, which would be prohibitively complex to simulate and whose behavior is probably not constrained in any meaningful way by any short proofs.

I think that a human can in fact try to maximize the output of such a reflective process, "to the best of their abilities." And this seems good enough for value alignment.

It's not important that we actually achieve optimality except on shorter-term instrumentally important problems such as gathering resources (for which we can in fact expect the abstractly motivated algorithm to converge to optimality).

Depends on the meaning of “compute.”

I am referring to generatability throughout.

Perhaps you assume that we provide the AGI an inefficient specification of a human that it cannot actually run even for a relatively short timespan. Instead, either we provide it an efficient whole brain emulation or we provide observations and let the AGI build models itself.

In order to train such an AI, we need to produce data of the form (observations of agents acting in an environment, corresponding learned utility function). In order for the human example to actually be in distribution, you would need for these agents + environments to be as sophisticated as humans acting in the real world.

So here is my impression of what you are proposing:

  • Sample some preferences
  • Sample a world including an agent imperfectly pursuing those preferences. This world should be as complex as our world.
  • Simulate some observations of that world
  • Evaluate some proposed actions according to the original preferences.

This is what seems implausible to me. Perhaps the clearest illustration of the difficulty:

I strongly suspect that we will build powerful AI systems (for which the alignment problem must be resolved) before we are able to simulate anything vaguely human-like, even setting aside the inverse problem (of simulating actual humans) altogether.

This seems pretty safe to me, given that a simulation of a vaguely human-like intelligence would itself constitute such a powerful AI system and there is no reason to think the two capabilities would arise simultaneously.

But in this case, it doesn't seem like we can implement the proposed training process.

I think it's reasonable to assume that any agent that meaningfully comprehends human values has fairly good models of humans. The agents on which we train don't have to be as complex as a low-level emulation of the human brain, they only have to be as complex as the model humans the AGI is going to imagine. Similarly, the worlds on which we train don't have to be as complex as a low-level emulation of physics, they only have to be as complex as the models that the model humans have for the real world.

The agents on which we train [...] only have to be as complex as the model humans the AGI is going to imagine.

This doesn't seem right. The agents on which we train must look indistinguishable from (a distribution including) humans, from the agent's standpoint. That is a much higher bar, even complexity-theoretically!

For example, the AI can tell that humans can solve Sudoku puzzles, even if it can't figure out how to solve Sudoku in a human-like way. The same holds for any aspect of any problem which can be more easily verified than implemented. So it seems that an AI would need most human abilities in order to fool itself.

My concern about this approach is closely related to my concerns about imitation learning. The situation is in some ways better and in some ways worse for your proposal---for example, you don't get to do meeting halfway, but you do get to use very inefficient algorithms. Note that the imitation learning approach also only needs to produce behavior that is indistinguishable from human as far as the agent can tell, so the condition is really quite similar.

But at any rate, I'm more interested in the informed oversight problem than this difficulty. If we could handle informed oversight, then I would be feeling very good about bootstrapped approval-direction (e.g. ALBA). Also note that if we could solve the informed oversight problem, then bootstrapped imitation learning would also work fine even if imitation learning is inefficient, for the reasons described here.

And if you can avoid the informed oversight problem I would really like to understand how. It looks to me like an essential step in getting an optimal policy in your sense of optimality, for the same reasons that I think it might be an essential part of getting any efficient aligned AI.

Consider an environment which consists of bitstrings followed by where is an invertible one-way function. If you apply my version of bounded Solomonoff induction to this environment, the model that will emerge is a model that generates , computes and outputs followed by . Thus appears in the internal state of the model before even though appears before in the temporal sequence (and even though physical causality might flow from to is some sense).

Analogically, I think that in the Sudoku example the model human will have mental access to a "cheat sheet" of the Sudoku puzzle generated by the environment which is not visible to the AGI.

Some problems have this special structure, but it doesn't seem like all of them do.

For a really straightforward example, consider a random SAT instance. What kind of cheat sheet is nature going to generate, without explicitly planting a solution? (Planted solutions look different from organic solutions to random instances.)

It might be that most human abilities are just as hard to verify as to execute, or else permit some kind of cheat sheet. (Or more generally, permit a cheat sheet for the part of the problem that is easier to verify than execute.) But at face value I don't see why this would be.

It's not clear how can you tell that a magic box solves random SAT instances. Some SAT instances are unsolvable so that the magic box will have to occasionally output but it's not clear how to distinguish between instances that are genuinely unsolvable and instances that you cannot solve yourself.

I guess there might be a distributional search problem s.t.

(i) A solution can be verified in polynomial time.

(ii) Every instance has a solution.

(iii) There is no way to efficiently generate valid input-output pairs obeying a distribution which is indistinguishable from the right distribution.

I see no obvious reason such cannot exist but I also don't have an example. It would be interesting to find one.

In any case, it is not clear how can the agent learn to identify models like this. You can only train a predictor on models which admit sampling. On the other hand, if there is some kind of transfer learning that allows generalizing to some class of models which don't admit sampling then it's plausible there is also transfer learning that allows generalizing utility inference from sampleable agents to agents in the larger class. Also, it might be possible to train utility inference on some sort of data produced by the predictor instead of on raw observations s.t. it's possible to sample the distribution of this intermediate data corresponding to complex agents.

But it's hard to demonstrate something concrete without having some model of how the hypothetical "superpredictor" works. As far as I know, it might be impossible to effectively use unsampleable models.

The point was that we may be able to train an agent to do what we want, even in cases where we can't effectively build a predictor.

Re: your example. You can do amplification to get exponentially close to certainty (choose instances that are satisfiable with 2/3 probability, and then consider the problem "solve at least half of these 1000 instances"). If you really want every instance to have a solution, then you can probably generate the instances pseudorandomly from a small enough seed and do a union bound.

By "predictor" I don't mean something that produces exact predictions, I mean something that produces probabilistic predictions of given quantities. Maybe we should call it "inductor" to avoid conflation with optimal predictors (even though the concepts are closely related). As I said before, I think that an agent has to have a reasonable model of humans to follow human values. Moreover an agent that doesn't have a reasonable model of humans is probably much less dangerous since it won't be able to manipulate humans (although I guess the risk is still non-negligible).

The question is what kind of inductors are complexity-theoretically feasible and what class of models do these inductors correspond to. Bounded Solomonoff induction using works on the class of samplable models. In machine learning language, inductors using samplable models are feasible since it is possible to train the inductors by sampling random such models (i.e. by sampling the bounded Solomonoff ensemble). On the other hand it's not clear what broader classes of models are admissible if any.

That said, it seems plausible that if it's feasible to construct inductors for a broader class, my procedure will remain efficient.

Model 1: The "superinductor" works by finding an efficient transformation of the input sequence and a good sampleable model for . E.g. contains the results of substituting the observed SAT-solutions into the observed SAT-instances. In this model, we can apply my procedure by running utility inference on instead of .

Model 2: The superinductor works via an optimal predictor for SAT*. I think that it should be relatively straightforward to show that given an optimal predictor for SAT + assuming the ability to design AGIs for target utility functions relative to a SAT oracle, there is an optimal predictor for the total utility function my utility inference procedure defines (after including external agents that run over a SAT oracle). Therefore it is possible to maximize the latter.

*A -optimal predictor for SAT cannot exist unless all sparse problems in NP have efficient heuristic algorithms in some sense, which is unlikely. On the other hand, there is no reason that I know why a -optimal predictor for SAT cannot exist.

I mostly agree with what Paul has said. I probably don't have a cardinal disagreement with you either; I think creating a theory of values for computationally bounded agents is useful. I just expect that you can't directly turn your definition into a way of inferring human values robustly; that will require a lot more work. It's almost certainly an important component to a full solution to value learning, and might also be useful in partial solutions.