by only considering the branches of reality that are consistent with our knowledge
I know that, in the branch of reality which actually happened, Omega predicted my counterfactual behaviour. I know that my current behaviour is heavily correlated with my counterfactual behaviour. So I know that I can logically cause Omega to give me $10,000. This seems exactly equivalent to Newcomb's problem, where I can also logically cause Omega to give me a lot of money.
So if by "considering [other branches of reality]" you mean "taking predicted counterfactuals into acco... (read more)
I don't see why the Counterfactual Prisoner's Dilemma persuades you to pay in the Counterfactual Mugging case. In the counterfactual prisoner's dilemma, I pay because that action logically causes Omega to give me $10,000 in the real world (via influencing the counterfactual). This doesn't require shifting the locus of evaluation to policies, as long as we have a good theory of which actions are correlated with which other actions (e.g. paying in heads-world and paying in tails-world).
In the counterfactual mugging, by contrast, the whole point is that payin... (read more)
Thanks for writing this post, Katja; I'm very glad to see more engagement with these arguments. However, I don't think the post addresses my main concern about the original coherence arguments for goal-directedness, which I'd frame as follows:
There's some intuitive conception of goal-directedness, which is worrying in the context of AI. The old coherence arguments implicitly used the concept of EU-maximisation as a way of understanding goal-directedness. But Rohin demonstrated that the most straightforward conception of EU-maximisation (which I'll call beh... (read more)
I personally found this post valuable and thought-provoking. Sure, there's plenty that it doesn't cover, but it's already pretty long, so that seems perfectly reasonable.
I particularly I dislike your criticism of it as strawmanish. Perhaps that would be fair if the analogy between RL and evolution were a standard principle in ML. Instead, it's a vague idea that is often left implicit, or else formulated in idiosyncratic ways. So posts like this one have to do double duty in both outlining and explaining the mainstream viewpoint (often a major task in its o... (read more)
... (read more)there’s a “solving the problem twice” issue. As mentioned above, in Case 5 we need both the outer and the inner algorithm to be able to do open-ended construction of an ever-better understanding of the world—i.e., we need to solve the core problem of AGI twice with two totally different algorithms! (The first is a human-programmed learning algorithm, perhaps SGD, while the second is an incomprehensible-to-humans learning algorithm. The first stores information in weights, while the second stores information in activations, assuming a GPT-like architecture.
It seems totally plausible to give AI systems an external memory that they can read to / write from, and then you learn linear algebra without editing weights but with editing memory. Alternatively, you could have a recurrent neural net with a really big hidden state, and then that hidden state could be the equivalent of what you're calling "synapses".
I agree with Steve that it seems really weird to have these two parallel systems of knowledge encoding the same types of things. If an AGI learned the skill of speaking english during training, but then learn... (read more)
Nice post. The one thing I'm confused about is:
Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).
It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objecti... (read more)
Great post, and I'm glad to see the argument outlined in this way. One big disagreement, though:
the Judge box will house a relatively simple algorithm written by humans
I expect that, in this scenario, the Judge box would house a neural network which is still pretty complicated, but which has been trained primarily to recognise patterns, and therefore doesn't need "motivations" of its own.
This doesn't rebut all your arguments for risk, but it does reframe them somewhat. I'd be curious to hear about how likely you think my version of the judge is, and why.
Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?
Broadly speaking, I think our disagreement here is closely related to one we've discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won't pursue this further.
Above you say:
Now, the basic problem: our agent’s utility function is mostly a function of latent variables. ... Those latent variables:
- May not correspond to any particular variables in the AI’s world-model and/or the physical world
- May not be estimated by the agent at all (because lazy evaluation)
- May not be determined by the agent’s observed data
… and of course the agent’s model might just not be very good, in terms of predictive power.
And you also discuss how:
... (read more)Human "values" are defined within the context of humans' world-models, and don't necessarily make
The question then is, what would it mean for such an AI to pursue our values?
Why isn't the answer just that the AI should:
1. Figure out what concepts we have;
2. Adjust those concepts in ways that we'd reflectively endorse;
3. Use those concepts?
The idea that almost none of the things we care about could be adjusted to fit into a more accurate worldview seems like a very strongly skeptical hypothesis. Tables (or happiness) don't need to be "real in a reductionist sense" for me to want more of them.
I agree with all the things you said. But you defined the pointer problem as: "what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model?" In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.
The problem of determining how to construct a feedback signal which refers to those variabl... (read more)
I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world. I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now.
It seems likely that an AGI will understand very well what I mean when I use english words to describe things, and also what a more intelligent version of me with more coherent concepts would want those words to actually refer to. Why does this not imply that the pointers problem will be solve... (read more)
I think 'robust instrumentality' is basically correct for optimal actions, because there's no question of 'emergence': optimal actions just are.
If I were to put my objection another way: I usually interpret "robust" to mean something like "stable under perturbations". But the perturbation of "change the environment, and then see what the new optimal policy is" is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent's inputs, or its state, and seeing whether it still behaved instrumentally.
A more accurate description might be something like "ubiquitous instrumentality"? But this isn't a very aesthetically pleasing name.
Can you elaborate? 'Robust' seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.
The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you're trying to do the former, but because "robust" modifies "instrumentality", the latter is a more natural interpretation.
For example, if I said "life on earth is... (read more)
Yepp, this is a good point. I agree that there won't be a sharp distinction, and that ML systems will continue to do online learning throughout deployment. Maybe I should edit the post to point this out. But three reasons why I think the training/deployment distinction is still underrated:
The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won't be fine by default.
I'm happy to wrap up this conversation in general, but it's worth noting before I do that I still strongly disagree with this comment. We've identified a couple of interesting facts about goals, like "unbounded large-scale final goals lead to convergent instrumental goals", but we have nowhere near a good enough understanding of the space of goal-like behaviour to say that everything apart from a "very small reg... (read more)
I agree with the two questions you've identified as the core issues, although I'd slightly rephrase the former. It's hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I'd rephrase the first option you mention as "feeling pretty confident that something that generalises from 1 week to 1 year won't become misaligned enough to cause disasters". This point seems ... (read more)
1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety.
2. We've even tried hard to imagine goals that aren't of this sort, and so far we haven't come up with anything. Things that seem promising, like "Place this strawberry on that plate, then do nothing else" actually don't work when you unpack the details.
Okay, this is where we disagree. I think what "unpacking the details" actually gives you is somet... (read more)
I disagree that we have no good justification for making the "vast majority" claim.
Can you point me to the sources which provide this justification? Your analogy seems to only be relevant conditional on this claim.
My point is that in the context in which the classic arguments appeared, they were useful evidence that updated people in the direction of "Huh AI could be really dangerous" and people were totally right to update in that direction on the basis of these arguments
They were right to update in that direction, but that doesn't mean that they were rig... (read more)
Re counterfactual impact: the biggest shift came from talking to Nate at BAGI, after which I wrote this post on disentangling arguments about AI risk, in which I identified the "target loading problem". This seems roughly equivalent to inner alignment, but was meant to avoid the difficulties of defining an "inner optimiser". At some subsequent point I changed my mind and decided it was better to focus on inner optimisers - I think this was probably catalysed by your paper, or by conversations with Vlad which were downstream of the paper. I think the paper ... (read more)
Ah, cool; I like the way you express it in the short form! I've been looking into the concept of structuralism in evolutionary biology, which is the belief that evolution is strongly guided by "structural design principles". You might find the analogy interesting.
One quibble: in your comment on my previous post, you distinguished between optimal policies versus the policies that we're actually likely to train. But this isn't a component of my distinction - in both cases I'm talking about policies which actually arise from training. My point is that there a... (read more)
Saying "vast majority" seems straightfowardly misleading. Bostrom just says "a wide range"; it's a huge leap from there to "vast majority", which we have no good justification for making. In particular, by doing so you're dismissing bounded goals. And if you're talking about a "state of ignorance" about AI, then you have little reason to override the priors we have from previous technological development, like "we build things that do what we want".
On your analogy, see the last part of my reply to Adam below. The process of building things intrinsically pi... (read more)
Thanks for the feedback! Some responses:
This looks like off-line training to me. That's not a problem per se, but it also means that you have an implicit hypothesis that the AGI will be model-based; otherwise, it would have trouble adapting its behavior after getting new information.
I don't really know what "model-based" means in the context of AGI. Any sufficiently intelligent system will model the world somehow, even if it's not trained in a way that distinguishes between a "model" and a "policy". (E.g. humans weren't.)
... (read more)On the other hand, the instrumental
If you're right about the motivations for the classic theses, then it seems like there's been too big a jump from "other people are wrong" to "arguments for AI risk are right". Establishing the possibility of something is very far from establishing that it's a "default outcome".
A couple of clarifications:
Type 2: Feedback which we use to decide whether to deploy trained agent.
Let's also include feedback which we can use to decide whether to stop deploying an agent; the central example in my head is an agent which has been deployed for some time before we discover that it's doing bad things.
Relatedly, another argument for type 1 !~ type 2 which seems important to me: type 2 feedback can look at long time horizons, which I expect to be very useful. (Maybe you included this in the cost estimate, but idk how to translate between... (read more)
Kinda, but I think both of these approaches are incomplete. In practice finding a definition and studying examples of it need to be interwoven, and you'll have a gradual process where you start with a tentative definition, identify examples and counterexamples, adjust the definition, and so on. And insofar as our examples should focus on things which are actually possible to build (rather than weird thought experiments like blockhead or the chinese room) then it seems like what I'm proposing has aspects of both of the approaches you suggest.
My guess is tha... (read more)
Hmm, okay, I think there's still some sort of disagreement here, but it doesn't seem particularly important. I agree that my distinction doesn't sufficiently capture the middle ground of interpretability analysis (although the intentional stance doesn't make use of that, so I think my argument still applies against it).
Hmmm, it doesn't seem like these two approaches are actually that distinct. Consider: in the forward approach, which intuitions about goal-directedness are you using? If you're only using intuitions about human goal-directedness, then you'll probably miss out on a bunch of important ideas. Whereas if you're using intuitions about extreme cases, like superintelligences, then this is not so different to the backwards approach.
Meanwhile, I agree that the backward approach will fail if we try to find "the fundamental property that the forward approach is tryin... (read more)
Cool, thanks for the clarifications. To be clear, overall I'm much more sympathetic to the argument as I currently understand it, than when I originally thought you were trying to draw a distinction between "new forms of reasoning honed by trial-and-error" in part 1 (which I interpreted as talking about systems lacking sufficiently good models of the world to find solutions in any other way than trial and error) and "systems that have a detailed understanding of the world" in part 2.
Let me try to sum up the disagreement. The key questions are:
To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world?
In most of the cases you've discussed, trying to do tasks over much longer time horizons involves... (read more)
I feel like a very natural version of "follow instructions" is "Do things that would the instruction-giver would rate highly." (Which is the generalization I'm talking about.) I don't think any of the arguments about "long horizon versions of tasks are different from short versions" tell us anything about which of these generalizations would be learnt (since they are both equally alien over long horizons).
Other versions like "Follow instructions (without regards to what the training process cares about)" seem quite likely to perform significantly worse on ... (read more)
Really, the only issue for our purposes with this definition is that it focuses on how goal-directedness emerges, instead of what it entails for a system. Hence it gives less of a handle to predict the behavior of a system than Dennett’s intentional stance for example.
Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven't observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given i... (read more)
In the second half of WFLL, you talk about "systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals". Does the first half of WFLL also primarily refer to systems with these properties? And if so, does "reasoning honed by trial-and-error" refer to the reasoning that those systems do?
If yes, then this undermines your core argument that "[some things] can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes", beca... (read more)
We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.
I think this is the key po... (read more)
suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?"
I'm less excited about this, and more excited about candidate training processes or candidate paradigms of AI research (for example, solutions to embedded agency). I expect that there will be a large cluster of techniques which produce safe AGIs, we just need to find them - which may be difficult, but hopefully less difficult with Debate involved.
I think I agree with all of this. In fact, this argument is one reason why I think Debate could be valuable, because it will hopefully increase the maximum complexity of arguments that humans can reliably evaluate.
This eventually fails at some point, but hopefully it fails after the point at which we can use Debate to solve alignment in a more scalable way. (I don't have particularly strong intuitions about whether this hope is justified, though.)
If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn't be any reason for "the kind of arguments humans can be swayed by" to work to build a spaceship. We'd just end up with some arbitrary set of rules fixed in place.
I agree with this. My position is not that explicit reasoning is arbitrary, but that it developed via an adversarial process where arguers would try to convince listeners of things, and then listeners would try to di... (read more)
DP: (sigh...) OK. I'm still never going to design an artificial intelligence to have uncertain observations. It just doesn't seem like something you do on purpose.
What makes you think that having certain observations is possible for an AI?
Ooops, yes, this seems correct. I'll edit mine accordingly.
A few things that I found helpful in reading this post:
This gives us a summary something like:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the pas... (read more)
This all seems straightforwardly correct, so I've changed the line in question accordingly. Thanks for the correction :)
One caveat: technical work to address #8 currently involves either preventing AGIs from being misaligned in ways that lead them to make threats, or preventing AGIs from being aligned in ways which make them susceptible to threats. The former seems to qualify as an aspect of the "alignment problem", the latter not so much. I should have used the former as an example in my original reply to you, rather than using the latter.
I'd say that each of #5-#8 changes the parts of "AI alignment" that you focus on. For example, you may be confident that your AI system is not optimising against you, without being confident that 1000 copies of your AI system working together won't be optimising against you. Or you might be confident that your AI system won't do anything dangerous in almost all situations, but no longer confident once you realise that threats are adversarially selected to be extreme.
Whether you count these shifts as "moving beyond the standard paradigm" depends, I guess, o... (read more)
Cool, glad to hear it. I'd clarify the summary slightly: I think all safety techniques should include at least a rough intuition for why they'll work in the scaled-up version, even when current work on them only applies them to simple AIs. (Perhaps this was implicit in your summary already, I'm not sure.)
One source of our disagreement: I would describe evolution as a type of local search. The difference is that it's local with respect to the parameters of a whole population, rather than an individual agent. So this does introduce some disanalogies, but not particularly significant ones (to my mind). I don't think it would make much difference to my heuristic if we imagined that humans had evolved via gradient descent over our genes instead.
In other words, I like the heuristic of backchaining to local search, and I think of it as a subset of my heuristic. T... (read more)
A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.
I think this is useful for framing my core concerns about current safety research:
I think that, because culture is eventually very useful for fitness, you can either think of the problem as evolution not optimising for culture, or evolution optimising for fitness badly. And these are roughly equivalent ways of thinking about it, just different framings. Paul notes this duality in his original post:
... (read more)If we step back from skills and instead look at outcomes we could say: “Evolution is always optimizing for fitness, and humans have now taken over the world.” On this perspective, I’m making a claim about the limits of evolution. First, evolut
Hmm, let's see. So the question I'm trying to ask here is: do other species lack proto-culture mainly because of an evolutionary oversight, or because proto-culture is not very useful until you're close to human-level in other respects? In other words, is the discontinuity we've observed mainly because evolution took a weird path through the landscape of possible minds, or because the landscape is inherently quite discontinuous with respect to usefulness? I interpret Paul as claiming the former.
But if the former is true, then we should expect that there ar... (read more)
So I think Debate is probably the best example of something that makes a lot of sense when applied to humans, to the point where they're doing human experiments on it already.
But this heuristic is actually a reason why I'm pretty pessimistic about most safety research directions.
I don't think that even philosophers take the "genie" terminology very seriously. I think the more general lesson is something like: it's particularly important to spend your weirdness points wisely when you want others to copy you, because they may be less willing to spend weirdness points.
The problem is that principle F elides over the difference between facts which are logically caused by your decision, and facts which aren't. For example, in Parfit's hi... (read more)