All of Richard_Ngo's Comments + Replies

Inner Alignment: Explain like I'm 12 Edition

The images in this post seem to be broken.

1Rafael Harth6dThanks. It's because directupload often has server issues. I was supposed to rehost all images from my posts to a more reliable host, but apparently forgot this one. I'll fix it in a couple of hours.
Thoughts on gradient hacking

What mechanism would ensure that these two logic pieces only fire at the same time? Whatever it is, I expect that mechanism to be changed in response to failures.

1Ofer Givoli15dThe two pieces of logic can use the same activation values as their input. For example, suppose they both (independently) cause failure if a certain activation value is above some threshold. (In which case each piece of logic "ruins" a different critical activation value).
Thoughts on gradient hacking

I discuss the possibility of it going in some other direction when I say "The two most salient options to me". But the bit of Evan's post that this contradicts is:

Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there.

Formalizing Objections against Surrogate Goals

Interesting report :) One quibble:

For one, our AIs can only use “things like SPI” if we actually formalize the approach

I don't see why this is the case. If it's possible for humans to start using things like SPI without a formalisation, why couldn't AIs too? (I agree it's more likely that we can get them to do so if we formalise it, though.)

1Vojtech Kovarik2moThanks for pointing this out :-). Indeed, my original formulation is false; I agree with the "more likely to work if we formalise it" formulation.
Frequent arguments about alignment

Whether this is a point for the advocate or the skeptic depends on whether advances in RL from human feedback unlock other alignment work more than they unlock other capabilities work. I think there's room for reasonable disagreement on this question, although I favour the former.

Frequent arguments about alignment

Skeptic: It seems to me that the distinction between "alignment" and "misalignment" has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: "AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)". Now people are using the word in sense 2: "AIs not quite doing what we want them to do". But when our current AIs aren't doing quite what we want them to do, is that mainly evidence that future, more general... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

I'm wondering if the disagreement over the centrality of this example is downstream from a disagreement about how easy the "alignment check-ins" that Critch talks about are. If they are the sort of thing that can be done successfully in a couple of days by a single team of humans, then I share Critch's intuition that the system in question starts off only slightly misaligned. By contrast, if they require a significant proportion of ... (read more)

4Samuel Dylan Martin2moPerhaps this is a crux in this debate: If you think the 'agent-agnostic perspective' is useful, you also think a relatively steady state of 'AI Safety via Constant Vigilance' is possible. This would be a situation where systems that aren't significantly inner misaligned (otherwise they'd have no incentive to care about governing systems, feedback or other incentives) but are somewhat outer misaligned (so they are honestly and accurately aiming to maximise some complicated measure of profitability or approval, not directly aiming to do what we want them to do), can be kept in check by reducing competitive pressures, building the right institutions and monitoring systems, and ensuring we have a high degree of oversight. Paul thinks that it's basically always easier to just go in and fix the original cause of the misalignment, while Andrew thinks that there are at least some circumstances where it's more realistic to build better oversight and institutions to reduce said competitive pressures, and the agent-agnostic perspective is useful for the latter of these project, which is why he endorses it. I think that this scenario of Safety via Constant Vigilance is worth investigating - I take Paul's later failure story [https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=GvnDcxYxg9QznBobv] to be a counterexample to such a thing being possible, as it's a case where this solution was attempted and works for a little while before catastrophically failing. This also means that the practical difference between the RAAP 1a-d failure stories and Paul's story just comes down to whether there is an 'out' in the form of safety by vigilance
Challenge: know everything that the best go bot knows about go

I'm not sure what you mean by "actual computation rather than the algorithm as a whole". I thought that I was talking about the knowledge of the trained model which actually does the "computation" of which move to play, and you were talking about the knowledge of the algorithm as a whole (i.e. the trained model plus the optimising bot).

Formal Inner Alignment, Prospectus

Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?

I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time).

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.

I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about fo... (read more)

3Abram Demski5moTo me, the post as written seems like enough to spell out my optimism... there multiple directions for formal work which seem under-explored to me. Well, I suppose I didn't focus on explaining why things seem under-explored. Hopefully the writeup-to-come will make that clear.
Formal Inner Alignment, Prospectus

I have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis. 

Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisatio... (read more)

I agree with much of this. I over-sold the "absence of negative story" story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, "mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?" -- and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the excep... (read more)

Challenge: know everything that the best go bot knows about go

The trained AlphaZero model knows lots of things about Go, in a comparable way to how a dog knows lots of things about running.

But the algorithm that gives rise to that model can know arbitrarily few things. (After all, the laws of physics gave rise to us, but they know nothing at all.)

1DanielFilan5moAh, understood. I think this is basically covered by talking about what the go bot knows at various points in time, a la this comment [https://www.lesswrong.com/posts/m5frrcYTSH6ENjsc9/challenge-know-everything-that-the-best-go-bot-knows-about?commentId=vXTiM5bJmrFdqHpKR] - it seems pretty sensible to me to talk about knowledge as a property of the actual computation rather than the algorithm as a whole. But from your response there it seems that you think that this sense isn't really well-defined.
Challenge: know everything that the best go bot knows about go

I'd say that this is too simple and programmatic to be usefully described as a mental model. The amount of structure encoded in the computer program you describe is very small, compared with the amount of structure encoded in the neural networks themselves. (I agree that you can have arbitrarily simple models of very simple phenomena, but those aren't the types of models I'm interested in here. I care about models which have some level of flexibility and generality, otherwise you can come up with dumb counterexamples like rocks "knowing" the laws of physic... (read more)

Agency in Conway’s Game of Life

I don't think there is a fundamental difference in kind between trees, bacteria, humans, and hypothetical future AIs

There's at least one important difference: some of these are intelligent, and some of these aren't.

It does seem plausible that the category boundary you're describing is an interesting one. But when you indicate in your comment below that you see the "AI hypothesis" and the "life hypothesis" as very similar, then that mainly seems to indicate that you're using a highly nonstandard definition of AI, which I expect will lead to confusion.

4Alex Flint5moWell surely if I built a robot that was able to gather resources and reproduce itself as effectively as either a bacterium or a tree, I would be entirely justified in calling it an "AI". I would certainly have no problem using that terminology for such a construction at any mainstream robotics conference, even if it performed no useful function beyond self-reproduction. Of course we wouldn't call an actual tree or an actual bacterium an "AI" because they are not artificial.
Agency in Conway’s Game of Life

It feels like this post pulls a sleight of hand. You suggest that it's hard to solve the control problem because of the randomness of the starting conditions. But this is exactly the reason why it's also difficult to construct an AI with a stable implementation. If you can do the latter, then you can probably also create a much simpler system which creates the smiley face.

Similarly, in the real world, there's a lot of randomness which makes it hard to carry out tasks. But there are a huge number of strategies for achieving things in the world which don't r... (read more)

6Alex Flint5moWell yes, I do think that trees and bacteria exhibit this phenomenon of starting out small and growing in impact. The scope of their impact is limited in our universe by the spatial separation between planets, and by the presence of even more powerful world-reshapers in their vicinity, such as humans. But on this view of "which entities are reshaping the whole cosmos around here?", I don't think there is a fundamental difference in kind between trees, bacteria, humans, and hypothetical future AIs. I do think there is a fundamental difference in kind between those entities and rocks, armchairs, microwave ovens, the Opportunity mars rovers, and current Waymo autonomous cars, since these objects just don't have this property of starting out small and eventually reshaping the matter and energy in large regions. (Surely it's not that it's difficult to build an AI inside Life because of the randomness of the starting conditions -- it's difficult to build an AI inside Life because writing full-AGI software is a difficult design problem, right?)
1AprilSR5moI think the stuff about the supernovas addresses this: a central point is that the “AI” must be capable of generating an arbitrary world state within some bounds.
Challenge: know everything that the best go bot knows about go

The human knows the rules and the win condition. The optimisation algorithm doesn't, for the same reason that evolution doesn't "know" what dying is: neither are the types of entities to which you should ascribe knowledge.

1DanielFilan5moSuppose you have a computer program that gets two neural networks, simulates a game of go between them, determines the winner, and uses the outcome to modify the neural networks. It seems to me that this program has a model of the 'go world', i.e. a simulator, and from that model you can fairly easily extract the rules and winning condition. Do you think that this is a model but not a mental model, or that it's too exact to count as a model, or something else?
Challenge: know everything that the best go bot knows about go

it's not obvious to me that this is a realistic target

Perhaps I should instead have said: it'd be good to explain to people why this might be a useful/realistic target. Because if you need propositions that cover all the instincts, then it seems like you're basically asking for people to revive GOFAI.

(I'm being unusually critical of your post because it seems that a number of safety research agendas lately have become very reliant on highly optimistic expectations about progress on interpretability, so I want to make sure that people are forced to defend that assumption rather than starting an information cascade.)

3DanielFilan5moOK, the parenthetical helped me understand where you're coming from. I think a re-write of this post should (in part) make clear that I think a massive heroic effort would be necessary to make this happen, but sometimes massive heroic efforts work, and I have no special private info that makes it seem more plausible than it looks a priori.
Challenge: know everything that the best go bot knows about go

As an additional reason for the importance of tabooing "know", note that I disagree with all three of your claims about what the model "knows" in this comment and its parent.

(The definition of "know" I'm using is something like "knowing X means possessing a mental model which corresponds fairly well to reality, from which X can be fairly easily extracted".)

1DanielFilan5moIn the parent, is your objection that the trained AlphaZero-like model plausibly knows nothing at all?
1DanielFilan5moOn that definition, how does one train an AlphaZero-like algorithm without knowing the rules of the game and win condition?
Challenge: know everything that the best go bot knows about go

I think at this point you've pushed the word "know" to a point where it's not very well-defined; I'd encourage you to try to restate the original post while tabooing that word.

This seems particularly valuable because there are some versions of "know" for which the goal of knowing everything a complex model knows seems wildly unmanageable (for example, trying to convert a human athlete's ingrained instincts into a set of propositions). So before people start trying to do what you suggested, it'd be good to explain why it's actually a realistic target.

2DanielFilan5moHmmm. It does seem like I should probably rewrite this post. But to clarify things in the meantime: * it's not obvious to me that this is a realistic target, and I'd be surprised if it took fewer than 10 person-years to achieve. * I do think the knowledge should 'cover' all the athlete's ingrained instincts in your example, but I think the propositions are allowed to look like "it's a good idea to do x in case y".
Gradations of Inner Alignment Obstacles

I used to define "agent" as "both a searcher and a controller"

Oh, I really like this definition. Even if it's too restrictive, it seems like it gets at something important.

I'm not sure what you meant by "more compressed".

Sorry, that was quite opaque. I guess what I mean is that evolution is an optimiser but isn't an agent, and in part this has to do with how it's a very distributed process with no clear boundary around it. Whereas when you have the same problem being solved in a single human brain, then that compression makes it easier to point to the huma... (read more)

Gradations of Inner Alignment Obstacles

To me it sounds like you're describing (some version of) agency, and so the most natural term to use would be mesa-agent.

I'm a bit confused about the relationship between "optimiser" and "agent", but I tend to think of the latter as more compressed, and so insofar as we're talking about policies it seems like "agent" is appropriate. Also, mesa-optimiser is taken already (under a definition which assumes that optimisation is equivalent to some kind of internal search).

3Abram Demski6moI'm not sure what you meant by "more compressed". I used to define "agent" as "both a searcher and a controller", IE, something which uses an internal selection/search of some kind to accomplish an external control task. This might be too restrictive, though.
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

What's your specific critique of this? I think it's an interesting and insightful point.

3Alex Turner6moLeCun claims too much. It's true that the case of animals like orangutans points to a class of cognitive architectures which seemingly don't prioritize power-seeking. It's true that this is some evidence against power-seeking behavior being common amongst relevant cognitive architectures. However, it doesn't show that instrumental subgoals are much weaker drives of behavior than hardwired objectives. One reading of this "drives of behavior" claim is that it has to be tautological; by definition, instrumental subgoals are always in service of the (hardwired) objective. I assume that LeCun is instead discussing "all else equal, will statistical instrumental tendencies ('instrumental convergence') be more predictive of AI behavior than its specific objective function?". But "instrumental subgoals are much weaker drives of behavior than hardwired objectives" is not the only possible explanation of "the lack of domination behavior in non-social animals"! Maybe the orangutans aren't robust to scale. Maybe orangutans do implement non power-seeking cognition, but maybe their cognitive architecture will be hard or unlikely for us to reproduce in a machine - maybe the distribution of TAI cognitive architectures we should expect, is far different from what orangutans are like. I do agree that there's a very good point in the neighborhood of the quoted argument. My steelman of this would be: (This is loose for a different reason, in that it presupooses a single relevant axis of variation between humans and orangutans. Is a personal computer more like a human, or more like an orangutan? But set that aside for the moment.) I think he's overselling the evidence. However, on reflection, I wouldn't pick out the point for such strong ridicule.
Coherence arguments imply a force for goal-directed behavior

My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn't have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts.

Yeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn... (read more)

1Adam Shimi6moMy first intuition is that I expect mapping internal concept to mathematical formalisms to be easier when the end goal is deconfusion and making sense of behaviors, compared to actually improving capabilities. But I'd have to think about it some more. Thanks at least for an interesting test to try to apply to my attempt. Okay, do you mean that you agree with my paragraph but what you are really arguing about is that utility functions don't care about the low-level/internals of the system, and that's why they're bad abstractions? (That's how I understand your liver and health points example).
Coherence arguments imply a force for goal-directed behavior

Wouldn't these coherence arguments be pretty awesome? Wouldn't this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?

Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)

But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle - for exampl... (read more)

1Adam Shimi6moI agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other. Where here the meaningful comes from thinking about cognition and what following such a utility function would entail. There's a pretty intuitive sense in which a utility function that encodes exactly a trajectory and nothing else, for a complex enough setting, doesn't look like a goal. A difference between us I think is that I expect that we can add structure that restricts the set of utility functions we consider (structure that comes from thinking among other things about cognition) such that maximizing the expected utility for such a constrained utility function would actually capture most if not all the aspect of goal-directedness that matters to us. My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn't have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts. So either adapting the state space and action space, or going for fixed spaces but mapping/equivalence classes/metrics on them that encode the relevant assumptions about cognition.
Gradations of Inner Alignment Obstacles

Do you think that's a problem?

I'm inclined to think so, mostly because terms shouldn't be introduced unnecessarily. If we can already talk about systems that are capable/competent at certain tasks, then we should just do that directly.

I guess the mesa- prefix helps point towards the fact that we're talking about policies, not policies + optimisers.

Probably my preferred terminology would be:

  • Instead of mesa-controller, "competent policy".
  • And then we can say that competent policies sometimes implement search or learning (or both, or possibly neither).
  • And when
... (read more)
4Abram Demski6moThinking about this more, I think maybe what I really want it to mean is: competent policies which are non-myopic in some sense. A truly myopic Q&A system doesn't feel much like a controller / inner optimizer (even if it is misaligned, it's not steering the world in a bad direction, because it's totally myopic). I'm not sure what sense of "myopia" I want to use, though.
Gradations of Inner Alignment Obstacles

Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward a misaligned objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.

I'm confused about what wouldn't qualify as a mesa-controller. In practice, is this not synonymous with "capable"?

Also, why include "misaligned" in this definition? If mesa-controller turns out to be a useful concept, then I'd want to talk about both aligned and misaligned mesa-controllers.

2Abram Demski6moRight, agreed, I'll consider editing. Do you think that's a problem?
The Counterfactual Prisoner's Dilemma

Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F).

The problem is that principle F elides over the difference between facts which are logically caused by your decision, and facts which aren't. For example, in Parfit's hi... (read more)

1Chris_Leong6mo"The problem is that principle F elides" - Yeah, I was noting that principle F doesn't actually get us there and I'd have to assume a principle of independence as well. I'm still trying to think that through.
The Counterfactual Prisoner's Dilemma

by only considering the branches of reality that are consistent with our knowledge

I know that, in the branch of reality which actually happened, Omega predicted my counterfactual behaviour. I know that my current behaviour is heavily correlated with my counterfactual behaviour. So I know that I can logically cause Omega to give me $10,000. This seems exactly equivalent to Newcomb's problem, where I can also logically cause Omega to give me a lot of money.

So if by "considering [other branches of reality]" you mean "taking predicted counterfactuals into acco... (read more)

1Chris_Leong6moHmm... that's a fascinating argument. I've been having trouble figuring out how to respond to you, so I'm thinking that I need to make my argument more precise and then perhaps that'll help us understand the situation. Let's start from the objection I've heard against Counterfactual Mugging. Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F). Now let's consider Counterfactual Prisoner's Dilemma. If the coin comes up HEADS, then principle F tells us that the counterfactuals need to have the COIN coming up HEADS as well. However, it doesn't tell us how to handle the impact of the agent's policy if they had seen TAILS. I think we should construct counterfactuals where the agent's TAILS policy is independent of its HEADS policy, whilst you think we should construct counterfactuals where they are linked. You justify your construction by noting that the agent can figure out that it will make the same decision in both the HEADS and TAILS case. In contrast, my tendency is to exclude information about our decision making procedures. So, if you knew you were a utility maximiser this would typically exclude all but one counterfactual and prevent us saying choice A is better than choice B. Similarly, my tendency here is to suggest that we should be erasing the agent's self-knowledge of how it decides so that we can imagine the possibility of the agent choosing PAY/NOT PAY or NOT PAY/PAY. But I still feel somewhat confused about this situation.
The Counterfactual Prisoner's Dilemma

I don't see why the Counterfactual Prisoner's Dilemma persuades you to pay in the Counterfactual Mugging case. In the counterfactual prisoner's dilemma, I pay because that action logically causes Omega to give me $10,000 in the real world (via influencing the counterfactual). This doesn't require shifting the locus of evaluation to policies, as long as we have a good theory of which actions are correlated with which other actions (e.g. paying in heads-world and paying in tails-world).

In the counterfactual mugging, by contrast, the whole point is that payin... (read more)

1Chris_Leong7moYou're correct that paying in Counterfactual Prisoner's Dilemma doesn't necessarily commit you to paying in Counterfactual Mugging. However, it does appear to provide a counter-example to the claim that we ought to adopt the principle of making decisions by only considering the branches of reality that are consistent with our knowledge as this would result in us refusing to pay in Counterfactual Prisoner's Dilemma regardless of the coin flip result. (Interestingly enough, blackmail problems seem to also demonstrate that this principle is flawed as well). This seems to suggest that we need to consider policies rather than completely separate decisions for each possible branch of reality. And while, as I already noted, this doesn't get us all the way, it does make the argument for paying much more compelling by defeating the strongest objection.
Coherence arguments imply a force for goal-directed behavior

Thanks for writing this post, Katja; I'm very glad to see more engagement with these arguments. However, I don't think the post addresses my main concern about the original coherence arguments for goal-directedness, which I'd frame as follows:

There's some intuitive conception of goal-directedness, which is worrying in the context of AI. The old coherence arguments implicitly used the concept of EU-maximisation as a way of understanding goal-directedness. But Rohin demonstrated that the most straightforward conception of EU-maximisation (which I'll call beh... (read more)

5Daniel Kokotajlo6moI love your health points analogy. Extending it, imagine that someone came up with "coherence arguments" that showed that for a rational doctor doing triage on patients, and/or for a group deciding who should do a risky thing that might result in damage, the optimal strategy involves a construct called "health points" such that: --Each person at any given time has some number of health points --Whenever someone reaches 0 health points, they (very probably) die --Similar afflictions/disasters tend to cause similar amounts of decrease in health points, e.g. a bullet in the thigh causes me to lose 5 hp and you to lose 5 hp and Katja to lose 5hp. Wouldn't these coherence arguments be pretty awesome? Wouldn't this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation? This is so despite the fact that someone could come along and say "Well these coherence arguments assume a concept (our intuitive concept) of 'damage,' they don't tell us what 'damage' means. (Ditto for concepts like 'die' and 'person' and 'similar') That would be true, and it would still be a good idea to do further deconfusion research along those lines, but it wouldn't detract much from the epistemic victory the coherence arguments won.
Against evolution as an analogy for how humans will create AGI

I personally found this post valuable and thought-provoking. Sure, there's plenty that it doesn't cover, but it's already pretty long, so that seems perfectly reasonable.

I particularly I dislike your criticism of it as strawmanish. Perhaps that would be fair if the analogy between RL and evolution were a standard principle in ML. Instead, it's a vague idea that is often left implicit, or else formulated in idiosyncratic ways. So posts like this one have to do double duty in both outlining and explaining the mainstream viewpoint (often a major task in its o... (read more)

Against evolution as an analogy for how humans will create AGI

there’s a “solving the problem twice” issue. As mentioned above, in Case 5 we need both the outer and the inner algorithm to be able to do open-ended construction of an ever-better understanding of the world—i.e., we need to solve the core problem of AGI twice with two totally different algorithms! (The first is a human-programmed learning algorithm, perhaps SGD, while the second is an incomprehensible-to-humans learning algorithm. The first stores information in weights, while the second stores information in activations, assuming a GPT-like architecture.

... (read more)
1Steve Byrnes7moThanks for cross-posting this! Sorry I didn't get around to responding originally. :-) For what it's worth, I figure that the neocortex has some number (dozens to hundreds, maybe 180 like your link says, I dunno) of subregions that do a task vaguely like "predict data X from context Y", with different X & Y & hyperparameters in different subregions. So some design work is obviously required to make those connections. (Some taste of what that might look like in more detail is maybe Randall O'Reilly's vision-learning model [https://arxiv.org/abs/1709.04654].) I figure this is vaguely analogous to figuring out what convolution kernel sizes and strides you need in a ConvNet, and that specifying all this is maybe hundreds or low thousands but not millions of bits of information. (I don't really know right now, I'm just guessing.) Where will those bits of information come from? I figure, some combination of: * automated neural architecture search * and/or people looking at the neuroanatomy literature and trying to copy ideas * and/or when the working principles of the algorithm are better understood, maybe people can just guess what architectures are reasonable, just like somebody invented U-Nets [https://en.wikipedia.org/wiki/U-Net] by presumably just sitting and thinking about what's a reasonable architecture for image segmentation, followed by some trial-and-error tweaking. * and/or some kind of dynamic architecture that searches for learnable relationships and makes those connections on the fly … I imagine a computer would be able to do that to a much greater extent than a brain (where signals travel slowly, new long-range high-bandwidth connections are expensive, etc.) If I understand your comment correctly, we might actually agree on the plausibility of the brute force "automated neural architecture search" / meta-learning case. …Except for the terminology! I'm not calling it "evolution analogy" because the final learning algorithm is mai
Against evolution as an analogy for how humans will create AGI

It seems totally plausible to give AI systems an external memory that they can read to / write from, and then you learn linear algebra without editing weights but with editing memory. Alternatively, you could have a recurrent neural net with a really big hidden state, and then that hidden state could be the equivalent of what you're calling "synapses".

I agree with Steve that it seems really weird to have these two parallel systems of knowledge encoding the same types of things. If an AGI learned the skill of speaking english during training, but then learn... (read more)

3Rohin Shah7moIdk, this just sounds plausible to me. I think the hope is that the weights encode more general reasoning abilities, and most of the "facts" or "background knowledge" gets moved into memory, but that won't happen for everything and plausibly there will be this strange separation between the two. But like, sure, that doesn't seem crazy. I do expect we reconsolidate into weights through some outer algorithm like gradient descent (and that may not require any human input). If you want to count that as "autonomously editing its weights", then fine, though I'm not sure how this influences any downstream disagreement. Similar dynamics in humans: 1. Children are apparently better at learning languages than adults; it seems like adults are using some different process to learn languages (though probably not as different as editing memory vs. editing weights) 2. One theory of sleep is that it is consolidating the experiences of the day into synapses, suggesting that any within-day learning is not relying as much on editing synapses. Tbc, I also think explicitly meta-learned update rules are plausible -- don't take any of this as "I think this is definitely going to happen" but more as "I don't see a reason why this couldn't happen". Fwiw I've mostly been ignoring the point of whether or not evolution is a good analogy. If you want to discuss that, I want to know what specifically you use the analogy for. For example: 1. I think evolution is a good analogy for how inner alignment issues can arise. 2. I don't think evolution is a good analogy for the process by which AGI is made (if you think that the analogy is that we literally use natural selection to improve AI systems). It seems like Steve is arguing the second, and I probably agree (depending on what exactly he means, which I'm still not super clear on).
The case for aligning narrowly superhuman models

Nice post. The one thing I'm confused about is:

Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).

It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objecti... (read more)

6Rohin Shah7moIt's important to distinguish between: * "We (Open Phil) are not sure whether we want to actively push this in the world at large, e.g. by running a grant round and publicizing it to a bunch of ML people who may or may not be aligned with us" * "We (Open Phil) are not sure whether we would fund a person who seems smart, is generally aligned with us, and thinks that the best thing to do is reward modeling work" My guess is that Ajeya means the former but you're interpreting it as the latter, though I could easily be wrong about either of those claims.
4Ajeya Cotra7moWe're simply not sure where "proactively pushing to make more of this type of research happen" should rank relative to other ways we could spend our time and money right now, and determining that will involve thinking about a lot of things that are not covered in this post (most importantly what the other opportunities are for our time and money). It might be a standard way to make progress, but I don't feel that this work has been the default so far — the other three types of research I laid out seem to have absorbed significantly more researcher-hours and dollars among people concerned with long-term AI risk reduction. (It's possible that human feedback is more common among people motivated by profit, but I doubt that because it doesn't seem that profitable yet.) Also, if we use a stricter definition of "narrowly superhuman" (i.e. the model should be capable of outperforming the evaluations — not just the demonstrations — of the humans training it), I'd argue that there hasn't been any work published on that so far.
Book review: "A Thousand Brains" by Jeff Hawkins

Great post, and I'm glad to see the argument outlined in this way. One big disagreement, though:

the Judge box will house a relatively simple algorithm written by humans

I expect that, in this scenario, the Judge box would house a neural network which is still pretty complicated, but which has been trained primarily to recognise patterns, and therefore doesn't need "motivations" of its own.

This doesn't rebut all your arguments for risk, but it does reframe them somewhat. I'd be curious to hear about how likely you think my version of the judge is, and why.

1Steve Byrnes7moOh I'm very open-minded. I was writing that section for an audience of non-AGI-safety-experts and didn't want to make things over-complicated by working through the full range of possible solutions to the problem, I just wanted to say enough to convince readers that there is a problem here, and it's not trivial. The Judge box (usually I call it "steering subsystem" [https://www.lesswrong.com/posts/SJXujr5a2NcoFebr4/mesa-optimizers-vs-steered-optimizers] ) can be anything. There could even be a tower of AGIs steering AGIs, IDA [https://www.lesswrong.com/tag/iterated-amplification]-style, but I don't know the details, like what you would put at the base of the tower. I haven't really thought about it. Or it could be deep neural net classifier. (How do you train it? "Here's 5000 examples of corrigibility, here's 5000 examples of incorrigibility"?? Or what? Beats me...) In this post [https://www.lesswrong.com/posts/wcNEXDHowiWkRxDNv/inner-alignment-in-salt-starved-rats] I proposed that the amygdala houses a supervised learning algorithm which does a sorta "interpretability" thing where it tries to decode the latent variables inside the neocortex, and then those signals are inputs to the reward calculation. I don't see how that kind of mechanism would apply to more complicated goals, and I'm not sure how robust it is. Anyway, yeah, could be anything, I'm open-minded.
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?

Broadly speaking, I think our disagreement here is closely related to one we've discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won't pursue this further.

2johnswentworth8moYeah, I wouldn't even include reward tampering. Outer alignment, as I think about it, is mostly the pointer problem, and the (values) pointer problem is a subset of outer alignment. (Though e.g. Evan would define it differently.)
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Above you say:

Now, the basic problem: our agent’s utility function is mostly a function of latent variables. ... Those latent variables:

  • May not correspond to any particular variables in the AI’s world-model and/or the physical world
  • May not be estimated by the agent at all (because lazy evaluation)
  • May not be determined by the agent’s observed data

… and of course the agent’s model might just not be very good, in terms of predictive power.

And you also discuss how:

Human "values" are defined within the context of humans' world-models, and don't necessarily make

... (read more)
3johnswentworth8moOk, a few things here... The post did emphasize, in many places, that there may not be any real-world thing corresponding to a human concept, and therefore constructing a pointer is presumably impossible. But the "thing may not exist" problem is only one potential blocker to constructing a pointer. Just because there exists some real-world structure corresponding to a human concept, or an AI has some structure corresponding to a human concept, does not mean we have a pointer. It just means that it should be possible, in principle, to create a pointer. So, the concept-existence problem is a strict subset of the pointer problem. Second, there are definitely parts of a whole alignment scheme which are not the pointer problem. For instance, inner alignment, decision theory shenanigans (e.g. commitment races), and corrigibility are all orthogonal to the pointers problem (or at least the pointers-to-values problem). Constructing a feedback signal which rewards the thing we want is not the same as building an AI which does the thing we want. Third, and most important... The key point is that all these examples involve solving an essentially-similar pointer problem. In example A, we need to ensure that our English-language commands are sufficient to specify everything we care about which the alien wouldn't guess on its own; that's the part which is a pointer problem. In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that's the part which is a pointer problem. In example C, we need to identify which of its neurons correspond to the concepts we want, and ensure that the correspondence is robust; that's the part which is a pointer problem. Example D is essentially the same as B, with weaker implicit priors. The essence of each of these is "make sure we actually point to the thing we want, and not to anything else". That's the part which is a pointer prob
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

The question then is, what would it mean for such an AI to pursue our values?

Why isn't the answer just that the AI should:
1. Figure out what concepts we have;
2. Adjust those concepts in ways that we'd reflectively endorse;
3. Use those concepts?

The idea that almost none of the things we care about could be adjusted to fit into a more accurate worldview seems like a very strongly skeptical hypothesis. Tables (or happiness) don't need to be "real in a reductionist sense" for me to want more of them.

3Abram Demski8moAgreed. The problem is with AI designs which don't do that. It seems to me like this perspective is quite rare. For example, my post Policy Alignment [https://www.lesswrong.com/posts/TeYro2ntqHNyQFx8r/policy-alignment] was about something similar to this, but I got a ton of pushback in the comments -- it seems to me like a lot of people really think the AI should use better AI concepts, not human concepts. At least they did back in 2018. As you mention, this is partly due to overly reductionist world-views. If tables/happiness aren't reductively real, the fact that the AI is using those concepts is evidence that it's dumb/insane, right? Illustrative excerpt from a comment [https://www.lesswrong.com/posts/TeYro2ntqHNyQFx8r/policy-approval?commentId=7XT99Rv6DkgvXyczR] there: Probably most of the problem was that my post didn't frame things that well -- I was mainly talking in terms of "beliefs", rather than emphasizing ontology, which makes it easy to imagine AI beliefs are about the same concepts but just more accurate. John's description of the pointers problem might be enough to re-frame things to the point where "you need to start from human concepts, and improve them in ways humans endorse" is bordering on obvious. (Plus I arguably was too focused on giving a specific mathematical proposal rather than the general idea.)
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I agree with all the things you said. But you defined the pointer problem as: "what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model?" In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

The problem of determining how to construct a feedback signal which refers to those variabl... (read more)

2Adam Shimi8moBut you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the concrete implementation details to capture the most efficient way of doing fusion; whereas my internal model of "fusion power generators" has a more concrete form and include safety guidelines. In general, I don't see why we should expect the abstraction most relevant for the AGI to be the one we're using. Maybe it uses the same words for something quite different, like how successive paradigms in physics use the same word (electricity, gravity) to talk about different things (at least in their connotations and underlying explanations). (That makes me think that it might be interesting to see how Kuhn's arguments about such incomparability of paradigms hold in the context of this problem, as this seems similar).
2johnswentworth8moThe problem is with what you mean by "find". If by "find" you mean "there exist some variables in the AI's world model which correspond directly to the things you mean by some English sentence", then yes, you've argued that. But it's not enough for there to exist some variables in the AI's world-model which correspond to the things we mean. We have to either know which variables those are, or have some other way of "pointing to them" in order to get the AI to actually do what we're saying. An AI may understand what I mean, in the sense that it has some internal variables corresponding to what I mean, but I still need to know which variables those are (or some way to point to them) and how "what I mean" is represented in order to construct a feedback signal. That's what I mean by "finding" the variables. It's not enough that they exist; we (the humans, not the AI) need some way to point to which specific functions/variables they are, in order to get the AI to do what we mean.
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world. I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now.

It seems likely that an AGI will understand very well what I mean when I use english words to describe things, and also what a more intelligent version of me with more coherent concepts would want those words to actually refer to. Why does this not imply that the pointers problem will be solve... (read more)

4johnswentworth8moThe AI knowing what I mean isn't sufficient here. I need the AI to do what I mean, which means I need to program it/train it to do what I mean. The program or feedback signal needs to be pointed at what I mean, not just whatever English-language input I give. For instance, if an AI is trained to maximize how often I push a particular button, and I say "I'll push the button if you design a fusion power generator for me", it may know exactly what I mean and what I intend. But it will still be perfectly happy to give me a design with some unintended side effects [https://www.lesswrong.com/posts/2NaAhMPGub8F2Pbr7/the-fusion-power-generator-scenario] which I'm unlikely to notice until after pushing the button.
Distinguishing claims about training vs deployment

I think 'robust instrumentality' is basically correct for optimal actions, because there's no question of 'emergence': optimal actions just are.

If I were to put my objection another way: I usually interpret "robust" to mean something like "stable under perturbations". But the perturbation of "change the environment, and then see what the new optimal policy is" is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent's inputs, or its state, and seeing whether it still behaved instrumentally.

A more accurate description might be something like "ubiquitous instrumentality"? But this isn't a very aesthetically pleasing name.

2Alex Turner8moI'd considered 'attractive instrumentality' a few days ago, to convey the idea that certain kinds of subgoals are attractor points during plan formulation, but the usual reading of 'attractive' isn't 'having attractor-like properties.'
3Alex Turner8moAh. To clarify, I was referring to holding an environment fixed, and then considering whether, at a given state, an action has a high probability of being optimal across reward functions. I think it makes to call those actions 'robustly instrumental.'
Distinguishing claims about training vs deployment

Can you elaborate? 'Robust' seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.

The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you're trying to do the former, but because "robust" modifies "instrumentality", the latter is a more natural interpretation.

For example, if I said "life on earth is... (read more)

2Alex Turner8moOne possibility is that we have to individuate these "instrumental convergence"-adjacent theses using different terminology. I think 'robust instrumentality' is basically correct for optimal actions, because there's no question of 'emergence': optimal actions just are. However, it doesn't make sense to say the same for conjectures about how training such-and-such a system tends to induce property Y, for the reasons you mention. In particular, if property Y is not about goal-directed behavior, then it no longer makes sense to talk about 'instrumentality' from the system's perspective. e.g. I'm not sure it makes sense to say 'edge detectors are robustly instrumental for this network structure on this dataset after X epochs'. (These are early thoughts; I wanted to get them out, and may revise them later or add another comment) EDIT: In the context of MDPs, however, I prefer to talk in terms of (formal) POWER and of optimality probability, instead of in terms of robust instrumentality. I find 'robust instrumentality' to be better as an informal handle, but its formal operationalization seems better for precise thinking.
Distinguishing claims about training vs deployment

Yepp, this is a good point. I agree that there won't be a sharp distinction, and that ML systems will continue to do online learning throughout deployment. Maybe I should edit the post to point this out. But three reasons why I think the training/deployment distinction is still underrated:

  1. In addition to the clarifications from this post, I think there are a bunch of other concepts (in particular recursive self-improvement and reward hacking) which weren't originally conceived in the context of modern ML, but which it's very important to understand in the c
... (read more)
Distinguishing claims about training vs deployment

The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won't be fine by default.

I'm happy to wrap up this conversation in general, but it's worth noting before I do that I still strongly disagree with this comment. We've identified a couple of interesting facts about goals, like "unbounded large-scale final goals lead to convergent instrumental goals", but we have nowhere near a good enough understanding of the space of goal-like behaviour to say that everything apart from a "very small reg... (read more)

1Daniel Kokotajlo8moOK, interesting. I agree this is a double crux. For reasons I've explained above, it doesn't seem like circular reasoning to me, it doesn't seem like I'm assuming that goals are by default unbounded and consequentialist etc. But maybe I am. I haven't thought about this as much as you have, my views on the topic have been crystallizing throughout this conversation, so I admit there's a good chance I'm wrong and you are right. Perhaps I/we will return to it one day, but for now, thanks again and goodbye!
Some thoughts on risks from narrow, non-agentic AI

I agree with the two questions you've identified as the core issues, although I'd slightly rephrase the former. It's hard to think about something being aligned indefinitely. But it seems like, if we have primarily used a given system for carrying out individual tasks, it would take quite a lot of misalignment for it to carry out a systematic plan to deceive us. So I'd rephrase the first option you mention as "feeling pretty confident that something that generalises from 1 week to 1 year won't become misaligned enough to cause disasters". This point seems ... (read more)

Distinguishing claims about training vs deployment

1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety.

2. We've even tried hard to imagine goals that aren't of this sort, and so far we haven't come up with anything. Things that seem promising, like "Place this strawberry on that plate, then do nothing else" actually don't work when you unpack the details.

Okay, this is where we disagree. I think what "unpacking the details" actually gives you is somet... (read more)

2Daniel Kokotajlo9moAgain, I'm not sure we disagree that much in the grand scheme of things -- I agree our thinking has improved over the past ten years, and I'm very much a fan of your more rigorous way of thinking about things. FWIW, I disagree with this: There are other explanations for this phenomenon besides "I'm able to have bounded goals." One is that you are in fact aligned with humans. Another is that you would in fact lead to catastrophe-by-the-standards-of-X if you were powerful enough and had a different goals than X. For example, suppose that right after reading this comment, you find yourself transported out of your body and placed into the body of a giant robot on an alien planet. The aliens have trained you to be smarter than them and faster than them; it's a "That Alien Message" scenario basically. And you see that the aliens are sending you instructions.... "PUT BERRY.... ON PLATE.... OVER THERE..." You notice that these aliens are idiots and left their work lying around the workshop, so you can easily kill them and take command of the computer and rescue all your comrades back on Earth and whatnot, and it really doesn't seem like this is a trick or anything, they really are that stupid... Do you put the strawberry on the plate? No. What people discovered back then was that you think you can "very easily imagine an AGI with bounded goals," but this is on the same level as how some people think they can "very easily imagine an AGI considering doing something bad, and then realizing that it's bad, and then doing good things instead." Like, yeah it's logically possible, but when we dig into the details we realize that we have no reason to think it's the default outcome and plenty of reason to think it's not. I was originally making the past tense claim, and I guess maybe now I'm making the present tense claim? Not sure, I feel like I probably shouldn't, you are about to tear me apart, haha... Other people being wrong can sometimes provide justification for making "
Distinguishing claims about training vs deployment

I disagree that we have no good justification for making the "vast majority" claim.

Can you point me to the sources which provide this justification? Your analogy seems to only be relevant conditional on this claim.

My point is that in the context in which the classic arguments appeared, they were useful evidence that updated people in the direction of "Huh AI could be really dangerous" and people were totally right to update in that direction on the basis of these arguments

They were right to update in that direction, but that doesn't mean that they were rig... (read more)

1Daniel Kokotajlo9moI think I agree that they may have been wrong to update as far as they did. (Credence = 50%) So maybe we don't disagree much after all. As for sources which provide that justification, oh, I don't remember, I'd start by rereading Superintelligence and Yudkowsky's old posts and try to find the relevant parts. But here's my own summary of the argument as I understand it: 1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety. 2. We've even tried hard to imagine goals that aren't of this sort, and so far we haven't come up with anything. Things that seem promising, like "Place this strawberry on that plate, then do nothing else" actually don't work when you unpack the details. 3. Therefore, we are justified in thinking that the vast majority of possible ASI goals will lead to doom via instrumental convergence. I agree that our thinking has improved since then, with more work being done on impact measures and bounded goals and quantilizers and whatnot that makes such things seem not-totally-impossible to achieve. And of course the model of ASI as a rational agent with a well-defined goal has justly come under question also. But given the context of how people were thinking about things at the time, I feel like they would have been justified in making the "vast majority of possible goals" claim, even if they restricted themselves to more modest "wide range" claims. I don't see how my analogy is only relevant conditional on this claim. To flip it around, you keep mentioning how AI won't be a random draw from the space of all possible goals -- why is that relevant? Very few things are random draws from the space of all possible X, yet reasoning about what's typical in the space of possible X's is often useful. Maybe I should have worked harder to pick a more real-world analogy than the weird loaded die one. Maybe
Distinguishing claims about training vs deployment

Re counterfactual impact: the biggest shift came from talking to Nate at BAGI, after which I wrote this post on disentangling arguments about AI risk, in which I identified the "target loading problem". This seems roughly equivalent to inner alignment, but was meant to avoid the difficulties of defining an "inner optimiser". At some subsequent point I changed my mind and decided it was better to focus on inner optimisers - I think this was probably catalysed by your paper, or by conversations with Vlad which were downstream of the paper. I think the paper ... (read more)

Distinguishing claims about training vs deployment

Ah, cool; I like the way you express it in the short form! I've been looking into the concept of structuralism in evolutionary biology, which is the belief that evolution is strongly guided by "structural design principles". You might find the analogy interesting.

One quibble: in your comment on my previous post, you distinguished between optimal policies versus the policies that we're actually likely to train. But this isn't a component of my distinction - in both cases I'm talking about policies which actually arise from training. My point is that there a... (read more)

1Alex Turner9moRight - I was pointing at the similarity in that both of our distinctions involve some aspect of training, which breaks from the tradition of not really considering training's influence on robust instrumentality. "Quite similar" was poor phrasing on my part, because I agree that our two distinctions are materially different. I think that "training goal convergence thesis" is way better, and I like how it accomodates dual meanings: the "goal" may be an instrumental or a final goal. Can you elaborate? 'Robust' seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment. I agree that switching costs are important to consider. However, I've recently started caring more about establishing and promoting clear nomenclature, both for the purposes of communication and for clearer personal thinking. My model of the 'instrumental convergence' situation is something like: * The switching costs are primarily sensitive to how firmly established the old name is, to how widely used the old name is, and the number of "entities" which would have to adopt the new name. * I think that if researchers generally agree that 'robust instrumentality' is a clearer name[1] and used it to talk about the concept, that the shift would naturally propagate through AI alignment circles and be complete within a year or two. This is just my gut sense, though. * The switch from "optimization daemons" to "mesa-optimizers" seemed to go pretty well * But 'optimization daemons' didn't have a wikipedia page yet (unlike [https://en.wikipedia.org/wiki/Instrumental_convergence] 'instrumental convergence') Of course, all of this is conditional on your agreeing that 'robust instrumentality' is in fact a better name; if you disagree, I'm interested in hearing why.[2] But if you agree, I think that the switch would
Distinguishing claims about training vs deployment

Saying "vast majority" seems straightfowardly misleading. Bostrom just says "a wide range"; it's a huge leap from there to "vast majority", which we have no good justification for making. In particular, by doing so you're dismissing bounded goals. And if you're talking about a "state of ignorance" about AI, then you have little reason to override the priors we have from previous technological development, like "we build things that do what we want".

On your analogy, see the last part of my reply to Adam below. The process of building things intrinsically pi... (read more)

2Daniel Kokotajlo9moI disagree that we have no good justification for making the "vast majority" claim, I think it's in fact true in the relevant sense. I disagree that we had little reason to override the priors we had from previous tech development like "we build things that do what we want." You are playing reference class tennis; we could equally have had a prior "AI is in the category of 'new invasive species appearing' and so our default should be that it displaces the old species, just as humans wiped out neanderthals etc." or a prior of "Risk from AI is in the category of side-effects of new technology; no one is doubting that the paperclip-making AI will in fact make lots of paperclips, the issue is whether it will have unintended side-effects, and historically most new techs do." Now, there's nothing wrong with playing reference class tennis, it's what you should do when you are very ignorant I suppose. My point is that in the context in which the classic arguments appeared, they were useful evidence that updated people in the direction of "Huh AI could be really dangerous" and people were totally right to update in that direction on the basis of these arguments, and moreover these arguments have been more-or-less vindicated by the last ten years or so, in that on further inspection AI does indeed seem to be potentially very dangerous and it does indeed seem to be not safe/friendly/etc. by default. (Perhaps one way of thinking about these arguments is that they were throwing in one more reference class into the game of tennis, the "space of possible goals" reference class.) I set up my analogy specifically to avoid your objection; the process of rolling a loaded die intrinsically is heavily biased towards a small section of the space of possibilities.
Load More