All of John_Maxwell's Comments + Replies

I'll respond to the "Predict hypothetical sensors" section in this comment.

First, I want to mention that predicting hypothetical sensors seems likely to fail in fairly obvious ways, e.g. you request a prediction about a sensor that's physically nonexistent and the system responds with a bunch of static or something. Note the contrast with the "human simulator" failure mode, which is much less obvious.

But I also think we can train the system to predict hypothetical sensors in a way that's really useful. As in my previous comment, I'll work from the assump... (read more)

2Paul Christiano1y
If we train on data about what hypothetical sensors should show (e.g. by experiments where we estimate what they would show using other means, or by actually building weird sensors), we could just end up getting predictions of whatever process we used to generate that data. In general the overall situation with these sensors seems quite similar to the original outer-level problem, i.e. training the system to answer "what would an ideal sensor show?" seems to run into the same issues as answering "what's actually going on?" E.g. your supersensor idea #3 seems to be similar to the "human operates SmartVault and knows if tampering occurred" proposal we discussed here [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.gi8iu4m98ok1] . I do think that excising knowledge is a substantive change, I feel like it's effectively banking on "if the model is ignorant enough about what humans are capable of, it needs to err on the side of assuming they know everything." But for intelligent models, it seems hard in general to excise knowledge of whole kinds of sensors (how do you know a lot about human civilization without knowing that it's possible to build a microphone?) without interfering with performance. And there are enough signatures that the excised knowledge is still not in-distribution with hypotheticals we make up (e.g. the possibility of microphones is consistent with everything else I know about human civilization and physics, the possibility of invisible and untouchable cameras isn't) and conservative bounds on what humans can know will still hit the one but not the other.

Thanks for the reply! I'll respond to the "Hold out sensors" section in this comment.

One assumption which seems fairly safe in my mind is that as the operators, we have control over the data our AI gets. (Another way of thinking about it is if we don't have control over the data our AI gets, the game has already been lost.)

Given that assumption, this problem seems potentially solvable

Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get

... (read more)

I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.

Some other thoughts:

  • I felt like the report was unusually well-motivated when I put my "mainstream ML" glasses on, relative to a lot of alignment work.

  • ARC's overall approach is probably my favorite out of alignment research groups I'm aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.

  • Not sure if this is relevant in practice, but... the report talks about Bayesian networks learned via

... (read more)
2Paul Christiano1y
Thanks for the kind words (and proposal)! I broadly agree that "train a bunch of models and panic if any of them say something is wrong." The main catch is that this only works if none of the models are optimized to say something scary, or to say something different for the sake of being different. We discuss this a bit in this appendix [https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.e6ihlg3adrp] . We're imagining the case where the predictor internally performs inference in a learned model, i.e. we're not explicitly learning a bayesian network but merely considering possibilities for what an opaque neural net is actually doing (or approximating) on the inside. I don't think this is a particularly realistic possibility, but if ELK fails in this kind of simple case it seems likely to fail in messier realistic cases. (We're actually planning to do a narrower contest focused on ELK proposals.)

(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.

Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.

For what it's worth, I often find Eliezer's arguments unpersuasive because they seem shallow. For example:

The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.

This seem like a fuzzy "outside view" sort of argument. (Compare with: "A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways." On the other hand, a causal model of a gun lets you explain which specif... (read more)

As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

Might depend whether the "thought" part comes before or after particular story text. If the "thought" comes after that story text, then it's generated conditional on that text, essentially a rationalization of that text from a hypothetical DM's point of view. If it comes before that sto... (read more)

I'm glad you are thinking about this. I am very optimistic about AI alignment research along these lines. However, I'm inclined to think that the strong form of the natural abstraction hypothesis is pretty much false. Different languages and different cultures, and even different academic fields within a single culture (or different researchers within a single academic field), come up with different abstractions. See for example lsusr's posts on the color blue or the flexibility of abstract concepts. (The Whorf hypothesis might also be worth looking i... (read more)

We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.

Are you sure that "episode" is the word you're looking for here?

https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL

I'm especially confused becaus... (read more)

5Evan Hubinger2y
Yes; episode is correct there—the whole point of that example is that, by breaking the episodic independence assumption, otherwise hidden non-myopia can be revealed. See the discussion of the prisoner's dilemma unit test in Krueger et al.'s “Hidden Incentives for Auto-Induced Distributional Shift [https://arxiv.org/abs/2009.09153]” for more detail on how breaking this sort of episodic independence plays out in practice.

...When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.

ASHLEY: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we're trying to do.

I'm not convinced your chess example, where the practical solution resembles the hypercomputer one, is representativ... (read more)

Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say.

Makes sense.

...there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded.

I think it might be useful to distinguish between being aware of onesel... (read more)

I suspect the best way to think about the polarizing political content thing which is going on right now is something like: The algorithm knows that if it recommends some polarizing political stuff, there's some chance you will head down a rabbit hole and watch a bunch more vids. So in terms of maximizing your expected watch time, recommending polarizing political stuff is a good bet. "Jumping out of the system" and noticing that recommending polarizing videos also polarizes society as a whole and gets them to spend more time on Youtube on a macro level ... (read more)

1Adam Shimi2y
I agree, but I think you can have problems (and even Predict-O-Matic like problems) without reaching that different sort of reasoning. Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say. One could also argue that because YouTube videos contain so much info about the real world, a powerful enough algorithm using them can probably develop a pretty good model of the world. And there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded. Agreed, this is more the kind of problem that emerges from RL like training. The page [https://wiki.tournesol.app/index.php/YouTube#Algorithms]on the Tournesol wiki about this subject points to this recent paper [https://www.ijcai.org/Proceedings/2019/0360.pdf] that propose a recommendation algorithm tried in practice on YouTube. AFAIK we don't have access to the actual algorithm used by YouTube, so it's hard to say whether it's using RL; but the paper above looks like evidence that it eventually will be.

Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world.

I see this argument pop up every so often. I don't find it persuasive because it presents a false choice in my view.

Our choice is not between having humans run the world and having a benevolent god run the world. Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).

If you think the alignment problem is hard, you probably ... (read more)

2Alex Flint2y
Right, I agree that having a benevolent god run the world is not within our choice set. Well just to re-state the suggestion in my original post: is this dichotomy between humans running the world or something else running the world really so inescapable? The child in the sand pit does not really run the world, and in an important way the parent also does not run the world -- certainly not from the perspective of the child's whole-life trajectory.

I was using it to refer to "any inner optimizer". I think that's the standard usage but I'm not completely sure.

With regard to the editing text discussion, I was thinking of a really simple approach where we resample words in the text at random. Perhaps that wouldn't work great, but I do think editing has potential because it allows for more sophisticated thinking.

Let's say we want our language model to design us an aircraft. Perhaps its starts by describing the engine, and then it describes the wings. Standard autoregressive text generation (assuming no lookahead) will allow the engine design to influence the wing design (assuming the engine design is inside the... (read more)

1Alex Gray2y
Clarifying Q: Does mesa-optimization refer to any inner optimizer, or one that is in particular not aligned with the outer context?

That's possible, but I'm guessing that it's not hard for a superintelligent AI to suddenly swallow an entire system using something like gray goo.

In this reaction to Critch's podcast, I wrote about some reasons to think that a singleton would be preferable to a multipolar scenario. Here's another rather exotic argument.

[The dark forest theory] is explained very well near the end of the science fiction novel, The Dark Forest by Liu Cixin.

...

When two [interstellar] civilizations meet, they will want to know if the other is going to be friendly or hostile. One side might act friendly, but the other side won't know if they are just faking it to put them at ease while armies are built in secret.

... (read more)
2Diffractor2y
Potential counterargument: Second-strike capabilities are still relevant in the interstellar setting. You could build a bunch of hidden ships in the oort cloud to ram the foe and do equal devastation if the other party does it first, deterring a first strike even with tensions and an absence of communication. Further, while the "ram with high-relativistic objects" idea works pretty well for preemptively ending a civilization confined to a handful of planets, AI's would be able to colonize a bunch of little asteroids and KBO's and comets in the oort cloud, and the higher level of dispersal would lead to preemptive total elimination being less viable.

A friend and I went on a long drive recently and listened to this podcast with Andrew Critch on ARCHES. On the way back from our drive we spent some time brainstorming solutions to the problems he outlines. Here are some notes on the podcast + some notes on our brainstorming.

In a possibly inaccurate nutshell, Critch argues that what we think of as the "alignment problem" is most likely going to get solved because there are strong economic incentives to solve it. However, Critch is skeptical of forming a singleton--he says people tend to resist that kind... (read more)

Your philosophical point is interesting; I have a post in the queue about that. However I don't think it really proves what you want it to.

Having John_Maxwell in the byline makes it far more likely that I'm the author of the post.

If humans can make useful judgements re: whether this is something I wrote, vs something nostalgebraist wrote to make a point about bylines, I don't see why a language model can't do the same, in principle.

GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahea

... (read more)

So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

A system which develops small-L lookahead (for L > 1) may find large-L lookahead to be nearby in programspace. If so, incentivizing the development of small-L lookahead makes it more likely that the system will try large-L lookahead and find it to be useful as well (in predicting chess moves for instance).

My intuition is that small-L lookahead could be close to large-L... (read more)

1nostalgebraist2y
No, it's a more philosophical point. Even if such things appear in the context window, they're simply more text, and convey the same kind of information: not "the denotation of these words is factually true," but "these words are part of the text." For example, the mere appearance of something like Title: Why GPT wants to mesa-optimize & how we might change this Author: John_Maxwell does not guarantee that the text following it bears that title, or was written by that author. (As I am illustrating right now.) Of course, one can design datasets where information like this is provided more authoritatively -- say, always at the start of each text, curated for quality, etc. (GPT isn't like that, but Grover and CTRL kind of are, in different ways.) But even that can only go so far. If the author is "Julius Caesar," does that mean the historical figure, some internet poster with that handle, or any number of other possibilities? A passage of fiction written in a character's voice -- is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character? (Note that the character is a much better answer to the question "who does this sound like?") And doesn't the date matter too, so we know whether this post in the venue "Less Wrong" was on 2010's LW or 2020's? Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words. You can try to hack in some sidechannels to provide context, but there's no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world. And just as a definitional manner, these sidechannels are modifications to "language modeling," which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information). Yeah, not for transformers I think. capybaralet's point about conservation of expected evidence applies here --
  1. Stopping mesa-optimizing completely seems mad hard.

As I mentioned in the post, I don't think this is a binary, and stopping mesa-optimization "incompletely" seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn't seem mad hard to me.

  1. Managing "incentives" is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence.

I'm less optimistic about this approach.

  1. There is a stochastic aspect to training ML models, so it's not enough to say "the incentives favor Mesa-Optimi

... (read more)
1David Scott Krueger2y
By managing incentives I expect we can, in practice, do things like: "[telling it to] restrict its lookahead to particular domains"... or remove any incentive for control of the environment. I think we're talking past each other a bit here.

Now it's true that efficiently estimating that conditional using a single forward pass of a transformer might involve approximations to beam search sometimes.

Yeah, that's the possibility the post explores.

At a high level, I don't think we really need to be concerned with this form of "internal lookahead" unless/until it starts to incorporate mechanisms outside of the intended software environment (e.g. the hardware, humans, the external (non-virtual) world).

Is there an easy way to detect if it's started doing that / tell it to restrict its lookahead... (read more)

3David Scott Krueger2y
My intuitions on this matter are: 1) Stopping mesa-optimizing completely seems mad hard. 2) Managing "incentives" is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence. 3) On the other hand, it's probably won't scale forever. To elaborate on the incentive management thing... if we figure that stuff out and do it right and it has the promise that I think it does... then it won't restrict lookahead to particular domains, but it will remove incentives for instrumental goal seeking. If we're still in a situation where the AI doesn't understand its physical environment and isn't incentivized to learn to control it, then we can do simple things like use a fixed dataset (as opposed to data we're collecting online) in order to make it harder for the AI to learn anything significant about its physical environment. Learning about the physical environment and using it to improve performance is not necessarily bad/scary absent incentives for control. However, I worry that having a good world model makes an AI much more liable to infer that it should try to control and not just predict the world.

BTW with regard to "studying mesa-optimization in the context of such systems", I just published this post: Why GPT wants to mesa-optimize & how we might change this.

I'm still thinking about the point you made in the other subthread about MAML. It seems very plausible to me that GPT is doing MAML type stuff. I'm still thinking about if/how that could result in dangerous mesa-optimization.

Well I suppose mesa-optimization isn't really a binary is it? Like, maybe there's a trivial sense in which self-attention "mesa-optimizes" over its input when figuring out what to pay attention to.

But ultimately, what matters isn't the definition of the term "mesa-optimization", it's the risk of spontaneous internal planning/optimization that generalizes in unexpected ways or operates in unexpected domains. At least in my mind. So the question is whether this considering multiple possibilities about text stuff could also improve its ability to consider ... (read more)

3Steve Byrnes2y
I think the Transformer is successful in part because it tends to solve problems by considering multiple possibilities, processing them in parallel, and picking the one that looks best. (Selection-type optimization [https://www.lesswrong.com/posts/ZDZmopKquzHYPRNxq/selection-vs-control].) If you train it on text prediction, that's part of how it will do text prediction. If you train it on a different domain, that's part of how it will solve problems in that domain too. I don't think GPT builds a "mesa-optimization infrastructure" and then applies that infrastructure to language modeling. I don't think it needs to. I think the Transformer architecture is already raring to go forth and mesa-optimize, as soon as you as you give it any optimization pressure to do so. So anyway your question is: can it display foresight / planning in a different domain via without being trained in that domain? I would say, "yeah probably, because practically every domain is instrumentally useful for text prediction". So somewhere in GPT-3's billions of parameters I think there's code to consider multiple possibilities, process them in parallel, and pick the best answer, in response to the question of What will happen next when you put a sock in a blender? or What is the best way to fix an oil leak?—not just those literal words as a question, but the concepts behind them, however they're invoked. (Having said that, I don't think GPT-3 specifically will do side-channel attacks, but for other unrelated reasons off-topic. Namely, I don't think it is capable of make the series of new insights required to develop an understanding of itself and its situation and then take appropriate actions. That's based on my speculations here [https://www.lesswrong.com/posts/SkcM4hwgH3AP6iqjs/can-you-get-agi-from-a-transformer] .)

This post distinguishes between mesa-optimization and learned heuristics. What you're describing sounds like learned heuristics. ("Learning which words are easy to rhyme" was an example I gave in the post.) Learned heuristics aren't nearly as worrisome as mesa-optimization because they're harder to modify and misuse to do planning in unexpected domains. When I say "lookahead" in the post I'm pretty much always referring to the mesa-optimization sort.

3Steve Byrnes2y
Suppose I said (and I actually believe something like this is true): "GPT often considers multiple possibilities in parallel for where the text is heading—including both where it's heading in the short-term (is this sentence going to end with a prepositional phrase or is it going to turn into a question?) and where it's heading in the long-term (will the story have a happy ending or a sad ending?)—and it calculates which of those possibilities are most likely in light of the text so far. It chooses the most likely next word in light of this larger context it figured out about where the text is heading." If that's correct, would you call GPT a mesa-optimizer?

The outer optimizer is the more obvious thing: it's straightforward to say there's a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way.

GPT generates text by repeatedly picking whatever word seems highest probability given all the words that came before. So if its notion of "highest probability" is almost, but not quite, answering every question accurately, I would expect a system which usually answers questions accurately but sometimes answers them inaccurately. That doesn't sound very scary?

esp. since GPT-3's 0-shot learning looks like mesa-optimization

Could you provide more details on this?

Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they'd like to see from GPT-3 in response to those inputs ("few-shot learning", right? I don't know what 0-shot learning you're referring to.) Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?

If that's what you're saying... That seems unlikely to me. GPT-3 is essentially a stack of 96 tr... (read more)

2David Scott Krueger2y
No, that's zero-shot. Few shot is when you train on those instead of just stuffing them into the context. It looks like mesa-optimization because it seems to be doing something like learning about new tasks or new prompts that are very different from anything its seen before, without any training, just based on the context (0-shot). By "training a model", I assume you mean "a ML model" (as opposed to, e.g. a world model). Yes, I am claiming something like that, but learning vs. inference is a blurry line. I'm not saying it's doing SGD; I don't know what it's doing in order to solve these new tasks. But TBC, 96 steps of gradient descent could be a lot. MAML does meta-learning with 1.

Let's say I'm trying to describe a hockey game. Modularizing the preferences from other aspects of the team algorithm makes it much easier to describe what happens at the start of the second period, when the two teams switch sides.

The fact that humans find an abstraction useful is evidence that an AI will as well. The notion that agents have preferences helps us predict how people will change their plans for achieving their goals when they receive new information. Same for an AI.

1Stuart Armstrong2y
Humans have a theory of mind, that makes certain types of modularizations easier. That doesn't mean that the same modularization is simple for an agent that doesn't share that theory of mind. Then again, it might be. This is worth digging into empirically. See my post on the optimistic and pessimistic scenarios [https://www.lesswrong.com/posts/6XLyM22PBd9qDtin8/learning-human-preferences-optimistic-and-pessimistic] ; in the optimistic scenario, preferences, human theory of mind, and all the other elements, are easy to deduce (there's an informal equivalence result; if one of those is easy to deduce, all the others are). So we need to figure out if we're in the optimistic or the pessimistic scenario.

Note that this decomposition is simpler than a "reasonable" version of figure 4, since the boundaries between the three modules don't need to be specified.

Consider two versions of the same program. One makes use of a bunch of copy/pasted code. The other makes use of a nice set of re-usable abstractions. The second program will be shorter/simpler.

Boundaries between modules don't cost very much, and modularization is super helpful for simplifying things.

1Stuart Armstrong2y
The best modularization for simplification will not likely correspond to the best modularization for distinguishing preferences from other parts of the agent's algorithm (that's the "Occam's razor" result).

Mostly no. I've been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).

Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I'd like the answ... (read more)

This translates in my head to "all we need to do is solve the main problems of alignment, and then we'll have an assistant which can help us clean up any easy loose ends".

Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don't think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer... (read more)

2johnswentworth2y
Mostly no. I've been trying to write a bit more about this topic lately; Alignment as Translation [https://www.lesswrong.com/posts/42YykiTqtGMyJAjDM/alignment-as-translation] is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here [https://www.lesswrong.com/posts/Et2pWrj4nWfdNAawh/what-specific-dangers-arise-when-asking-gpt-n-to-write-an?commentId=9Gq55Mx6TFYYTT6Fa] are similar to that). Using GPT-like systems to simulate alignment researchers' writing is a probably-safer use-case, but it still runs into the core catch-22. Either: * It writes something we'd currently write, which means no major progress (since we don't currently have solutions to the major problems and therefore can't write down such solutions), or * It writes something we currently wouldn't write, in which case it's out-of-distribution and we have to worry about how it's extrapolating us I generally expect the former to mostly occur by default; the latter would require some clever prompts. I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we're more useful to simulate. This sounds like a great tool to have. It's exactly the sort of thing which is probably marginally useful. It's unlikely to help much on the big core problems; it wouldn't be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact. I do think a lot of the things you're suggesting would be valuable and worth doing, on the margin. They're probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they're still useful. The "safety problems too complex for ourselves" are things like the fusion

If we had that - not necessarily a full model of human values, just a formalization which we were confident could represent them - then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.

Do you have in mind a specific aspect of human values that couldn't be represented using, say, the reward function of a reinforcement learning agent AI?

On the tools side, I assume the tools will be reasoning about systems/problems which humans can't understand - that's the main value prop in the first place. Trying to

... (read more)
1johnswentworth2y
It's not the function-representation that's the problem, it's the type-signature of the function. I don't know what such a function would take in or what it would return. Even RL requires that we specify the input-output channels up-front. This translates in my head to "all we need to do is solve the main problems of alignment, and then we'll have an assistant which can help us clean up any easy loose ends". More generally: I'm certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that's very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here? I don't think solving FAI involves reasoning about things beyond humans. I think the AIs themselves will need to reason about things beyond humans, and in particular will need to reason about complex safety problems on a day-to-day basis, but I don't think that designing a friendly AI is too complex for humans. Much of the point of AI is that we can design systems which can reason about things too complex for ourselves. Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves. What notion of "corrigible" are you using here? It sounds like it's not MIRI's "the AI won't disable its own off-switch" notion.

This comment definitely wins the award for best comment on the post so far.

Thanks!

I don't consider myself an expert on the unsupervised learning literature by the way, I expect there is more cool stuff to be found.

I don't expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.

Can you be more specific about the theoretical bottlenecks that seem most important?

I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away).

I agree that Tool AI is not inherently safe. The key question is which problem is easier: the alignment problem, or the safe-use-of... (read more)

1johnswentworth2y
Type signature of human values is the big one. I think it's pretty clear at this point that utility functions aren't the right thing, that we value things "out in the world" as opposed to just our own "inputs" or internal state, that values are not reducible to decisions or behavior, etc. We don't have a framework for what-sort-of-thing human values are. If we had that - not necessarily a full model of human values, just a formalization which we were confident could represent them - then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc. A good argument, but I see the difficulties of safe tool AI and the difficulties of alignment as mostly coming from the same subproblem. To the extent that that's true, alignment work and tool safety work need to be basically the same thing. On the tools side, I assume the tools will be reasoning about systems/problems which humans can't understand - that's the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the "tools" have their own models of human values, and use those models to check the safety of their outputs... which brings us right back to alignment. Simple mechanisms like always displaying an estimated probability that I'll regret asking a question would probably help, but I'm mainly worried about the unknown unknowns, not the known unknowns. That's part of what I mean when I talk about marginal improvements vs closing the bulk of the gap - the unknown unknowns are the bulk of the gap. (I could see tools helping in a do-the-same-things-but-faster sort of way, and human-mimicking approaches in particular are potentially helpful there. On the other han

My take is that corrigibility is sufficient to get you an AI that understands what it means to "keep improving their understanding of Alice's values and to serve those values".  I don't think the AI needs to play the "genius philosopher" role, just the "loyal and trustworthy servant" role.  A superintelligent AI which plays that role should be able to facilitate a "long reflection" where flesh and blood humans solve philosophical problems.

(I also separately think unsupervised learning systems could in principle make philosophical breakthroughs. Maybe one already has.)

Thanks a lot for writing this. I've been thinking about FAI plans along these lines for a while now, here are some thoughts on specific points you made.

First, I take issue with the "Alignment By Default" title. There are two separate questions here. Question #1 is whether we'd have a good outcome if everyone concerned with AI safety got hit by a bus. Question #2 is whether there's a way to create Friendly AI using unsupervised learning. I'm rather optimistic that the answer to Question #2 is yes. I find the unsupervised learning family of approaches ... (read more)

1johnswentworth2y
Thanks for the comments, these are excellent! Valid complaint on the title, I basically agree. I only give the path outlined in the OP ~10% of working without any further intervention by AI safety people, and I definitely agree that there are relatively-tractable-seeming ways to push that number up on the margin. (Though those would be marginal improvements only; I don't expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.) I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario [https://www.lesswrong.com/posts/2NaAhMPGub8F2Pbr7/the-fusion-power-generator-scenario] is a prototypical example of my concerns here (also see this comment [https://www.lesswrong.com/posts/2NaAhMPGub8F2Pbr7/the-fusion-power-generator-scenario?commentId=W6uhX8gbf2BczSmfj] on it, which explains what I see as the key take-away). The idea of simulating a human doing moral philosophy is a bit different than what I usually imagine, though; it's basically like taking an alignment researcher and running them on faster hardware. That doesn't directly solve any of the underlying conceptual problems - it just punts them to the simulated researchers - but it is presumably a strict improvement over a limited number of researchers operating slowly in meatspace. Alignment research ems! I don't think this helps much. Two examples of "specifics of the data collection process" to illustrate: * Suppose our data consists of human philosophers' writing on morality. Then the "specifics of the data collection process" includes the humans' writing skills and signalling incentives, and everything else besides the underlying human values. * Suppose our data consists of humans' choices in various situations. Then the "specifics of the data collection process" includes the humans' mistaken reasoning, habits, divergence of decision-making from values, and everything else besides the underlying hu

Some notes on the loss function in unsupervised learning:

Since an unsupervised learner is generally just optimized for predictive power

I think it's worthwhile to distinguish the loss function that's being optimized during unsupervised learning, vs what the practitioner is optimizing for. Yes, the loss function being optimized in an unsupervised learning system is frequently minimization of reconstruction error or similar. But when I search for "unsupervised learning review" on Google Scholar, I find this highly cited paper by Bengio et al. The abstr... (read more)

3johnswentworth2y
This comment definitely wins the award for best comment on the post so far. Great ideas, highly relevant links. I especially like the deliberate noise idea. That plays really nicely with natural abstractions as information-relevant-far-away: we can intentionally insert noise along particular dimensions, and see how that messes with prediction far away (either via causal propagation or via loss of information directly). As long as most of the noise inserted is not along the dimensions relevant to the high-level abstraction, denoising should be possible. So it's very plausible that denoising autoencoders are fairly-directly incentivized to learn natural abstractions. That'll definitely be an interesting path to pursue further. Assuming that the denoising autoencoder objective more-or-less-directly incentivizes natural abstractions, further refinements on that setup could very plausibly turn into a useful "ease of interpretability" objective.

See also Robustness to Scale. You wrote that "we expect that the failure modes which still appear under such assumptions are the hard failure modes" (emphasis mine). But there are some failure modes which don't appear with existing algorithms, yet are hypothesized to appear in the limit of more data and compute, such as the "malign universal prior" problem. It's unclear how much to worry about these problems, because as you say, we don't actually expect to use e.g. Solomonoff induction. I suspect a key issue is whether the problem is an inevitable result o... (read more)

But there are some failure modes which don't appear with existing algorithms, yet are hypothesized to appear in the limit of more data and compute...

This is a great point to bring up. One thing the OP probably doesn't emphasize enough is: just because one particular infinite-data/compute algorithm runs into a problem, does not mean that problem is hard.

Zooming out for a moment, the strategy the OP is using is problem relaxation: we remove a constraint from the problem (in this case data/compute constraints), solve that relaxed problem, then use the relaxed... (read more)

A general method for identifying dangers: For every topic which gets discussed on AF, figure out what could go wrong if GPT-N decided to write a post on that topic.

  • GPT-N writes a post about fun theory. It illustrates principles of fun theory by describing an insanely fun game you can play with an ordinary 52-card deck. FAI work gets pushed aside as everyone becomes hooked on this new game. (Procrastination is an existential threat!)

  • GPT-N writes a post about human safety problems. To motivate its discussion, it offers some extraordinarily compelli

... (read more)

One class of problem comes about if GPT-N starts thinking about "what would a UFAI do in situation X":

  • Inspired by AI box experiments, GPT-N writes a post about the danger posed by ultra persuasive AI-generated arguments for bad conclusions, and provides a concrete example of such an argument.
  • GPT-N writes a post where it gives a detailed explanation of how a UFAI could take over the world.  Terrorists read the post and notice that UFAI isn't a hard requirement for the plan to work.
  • GPT-N begins writing a post about mesa-optimizers and starts simulating a mesa-optimizer midway through.

There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper.

 

This is simultaneously

  • a major retreat from the "default outcome is doom" thesis which is frequently trotted out on this site (the statement is consistent with a AGI design that's is 99.9% likely to be safe, which is very much incompatible with "default outcome is doom")
  • unrelated to our upload discussion (an upload is not an AGI, but you said even a great upload wasn't good enough for you)

You've picked a position vaguely in between th... (read more)

3johnswentworth3y
There isn't a standard reference because the argument takes one sentence, and I've been repeating it over and over again: what would Bayesian updates on low-level physics do? That's the unique solution with best-possible predictive power, so we know that anything which scales up to best-possible predictive power in the limit will eventually behave that way. (BTW I think that link is dead) The "what would Bayesian updates on a low-level model do?" question is exactly the argument that the bridge design cannot be extended indefinitely, which is why I keep bringing it up over and over again. This does point to one possibly-useful-to-notice ambiguous point: the difference between "this method would produce an aligned AI" vs "this method would continue to produce aligned AI over time, as things scale up". I am definitely thinking mainly about long-term alignment here; I don't really care about alignment on low-power AI like GPT-3 except insofar as it's a toy problem for alignment of more powerful AIs (or insofar as it's profitable, but that's a different matter). I've been less careful than I should be about distinguishing these two in this thread. All these things which we're saying "might work" are things which might work in the short term on some low-power AI, but will definitely not work in the long term on high-power AI. That's probably part of why it seems like I keep switching positions - I haven't been properly distinguishing when we're talking short-term vs long-term. A second comment on this: If we want to make a piece of code faster, the first step is to profile the code to figure out which step is the slow one. If we want to make a beam stronger, the first step is to figure out where it fails. If we want to extend a bridge design, the first step is to figure out which piece fails under load if we just elongate everything. Likewise, if we want to scale up an AI alignment method, the first step is to figure out exactly how it fails under load as the AI's

if we're not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working.

Optimize for having a diverse range of models that all seem to fit the data.

1johnswentworth3y
How would that fix any of the problems we've been talking about?

It doesn't matter how high-fidelity the upload is or how benevolent the human is, I'm not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that.

Here are some of the people who have the power to set off nukes right now:

"Don't let the perfect be the enemy of the good" is advice for writing emails and cleaning the house, not nuclear security.

Tell that to the Norwegian commandos who successfu... (read more)

2johnswentworth3y
There is a single coherent position here in which it is very hard to build an AGI which reliably is not a paperclipper. Yes, there are straightforward ways one might be able to create a helpful non-paperclipper AGI. But that "might" is carrying a lot of weight. All those straightforward ways have failure modes which will definitely occur in at least some range of parameters, and we don't know exactly what those parameter ranges are. It's sort of like saying: "It's very hard to design a long bridge which won't fall down!" "Well actually here are some straightforward ways one might be able to create a long non-falling-down bridge..." <shows picture of a wooden truss> What I'm saying is, that truss is design is 100% going to fail once it gets big enough, and we don't currently know how big that is. When I say "it's hard to design a long bridge which won't fall down", I do not mean a bridge which might not fall down if we're lucky and just happen to be within the safe parameter range. These are sufficient conditions for a careful strategy to make sense, not necessary conditions. Here's another set of sufficient conditions, which I find more realistic: the gains to be had in reducing AI risk are binary. Either we find the "right" way of doing things, in which case risk drops to near-zero, or we don't, in which case it's a gamble and we don't have much ability to adjust the chances/payoff. There are no significant marginal gains to be had.

You say: "we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute".  I think this depends on specific details of how the system is engineered.

"Physical process which generates the label koala" is not the same as "koala", and the system can get higher predictive power by modelling the former rather than the latter.

Suppose we use classification accuracy as our loss function.  If all the koalas are correctly classified by both models, then the two models have equal loss function scores.... (read more)

2johnswentworth3y
It's certainly conceivable to engineer systems some other way, and indeed I hope we do. Problem is: * if we just optimize for predictive power, then abstract notions will definitely be thrown away once the system can discover and perform Bayesian updates on low-level physics. (In principle we could engineer a system which never discovers that, but then it will still optimize predictive power by coming as close as possible.) * if we're not just optimizing for predictive power, then we need some other design criteria, some other criteria for whether/how well the system is working. In one sense, the goal of all this abstract theorizing is to identify what that other criteria needs to be in order to reliably end up using the "right" abstractions in the way we want. We could probably make up some ad-hoc criteria which works at least sometimes, but then as architectures and hardware advance over time we have no idea when that criteria will fail. (Probably tangential) No, this reveals that my verbal definition of a sandwich was not a particularly accurate description of my underlying notion of sandwich - which is indeed the case for most definitions most of the time [https://www.lesswrong.com/s/SGB7Y5WERh4skwtnb]. It certainly does not prove the existence of multiple ways of knowing what a sandwich is. Also, even if there's some sort of ensembling, the concept "sandwich" still needs to specify one particular ensemble. We've shifted to arguing over a largely orthogonal topic. The OP is mostly about the interface by which GPT can be aligned to things. We've shifted to talking about what alignment means in general, and what's hard about aligning systems to the kinds of things we want. An analogy: the OP was mostly about programming in a particular language, while our current discussion is about what kinds of algorithms we want to write. Prompts are a tool/interface for via which one can align a certain kind of system (i.e. GPT-3) with certain kinds o

Well, the moral judgements of a high-fidelity upload of a benevolent human are also a proxy for human values--an inferior proxy, actually.  Seems to me you're letting the perfect be the enemy of the good.

It doesn't matter how high-fidelity the upload is or how benevolent the human is, I'm not happy giving them the power to launch nukes without at least two keys, and a bunch of other safeguards on top of that. "Don't let the perfect be the enemy of the good" is advice for writing emails and cleaning the house, not nuclear security.

The capabilities of powerful AGI will be a lot more dangerous than nukes, and merit a lot more perfectionism.

Humans themselves are not aligned enough that I would be happy giving them the sort of power that AGI will eventually hav... (read more)

This is essentially the "tasty ice cream flavors" problem, am I right?  Trying to check if we're on the same page.

If so: John Wentsworth said

"Tasty ice cream flavors" is also a natural category if we know who the speaker is

So how about instead of talking about "human values", we talk about what a particular moral philosopher endorses saying or doing, or even better, what a committee of famous moral philosophers would endorse saying/doing.

2johnswentworth3y
No, this is not the "tasty ice cream flavors" problem. The problem there is that the concept is inherently relative to a person. That problem could apply to "human values", but that's a separate issue from what dxu is talking about. The problem is that "what a committee of famous moral philosophers would endorse saying/doing", or human written text containing the phrase "human values", is a proxy for human values, not a direct pointer to the actual concept. And if a system is trained to predict what the committee says, or what the text says, then it will learn the proxy, but that does not imply that it directly uses the concept.

Likewise, GPT-style models should have no trouble learning some model with human values embedded in it. But that embedding will not necessarily be simple; there won't just be a neuron that lights up in response to humans having their values met. The model will have a notion of human values embedded in it, but it won't actually use "human values" as an abstract object in its internal calculations; it will work with some lower-level "components" which themselves implement/embed human values.

If it's read moral philosophy, it should have some notion of what... (read more)

2johnswentworth3y
I generally agree with this. The things I'm saying about human values also apply to koala classification. As with koalas, I do think there's probably a wide range of parameters which would end up using the "right" level of abstraction for human values to be "natural". On the other hand, for both koalas and humans, we can be fairly certain that a system will stop directly using those concepts once it has sufficient available compute - again, because Bayesian updates on low-level physics are just better in terms of predictive power. Right now, we have no idea when that line will be crossed - just an extreme upper bound. We have no idea how wide/narrow the window of training parameters is in which either "koalas" or "human values" is a natural level of abstraction. Ability to differentiate marsupials does not imply that the system is directly using the concept of koala. Yet again, consider how Bayesian updates on low-level physics would respond to the marsupial-differentiation task: it would model the entire physical process which generated the labels on the photos/videos. "Physical process which generates the label koala" is not the same as "koala", and the system can get higher predictive power by modelling the former rather than the latter. When we move to human values, that distinction becomes a lot more important: "physical process which generates the label 'human values satisfied'" is not the same as "human values satisfied". Confusing those two is how we get Goodhart problems. We don't need to go all the way to low-level physics models in order for all of that to apply. In order for a system to directly use the concept "koala", rather than "physical process which generates the label koala", it has to be constrained on compute in a way which makes the latter too expensive - despite the latter having higher predictive power on the training data. Adding in transfer learning on some lower-level components does not change any of that; it should still be possible
3David Xu3y
GPT-3 and systems like it are trained to mimic human discourse. Even if (in the limit of arbitrary computational power) it manages to encode an implicit representation of human values somewhere in its internal state, in actual practice there is nothing tying that representation to the phrase "human values", since moral philosophy is written by (confused) humans, and in human-written text the phrase "human values" is not used in the consistent, coherent manner that would be required to infer its use as a label for a fixed concept.

And that's exactly the kind of complexity which is hard for something based on predictive power. Lower abstraction levels should generally perform better in terms of raw prediction, but the thing we want to point to lives at a high abstraction level.

 

You told me that "it's not actually hard for an unsupervised learner to end up with some notion of human values embedded in its world-model".  Now you're saying that things based on "predictive power" have trouble learning things at high abstraction levels.  Doesn't this suggest that your origin... (read more)

2johnswentworth3y
The original example is a perfect example of what this looks like: an unsupervised learner, given crap-tons of data and compute, should have no difficulty learning a low-level physics model of humans. That model will have great predictive power, which is why the model will learn it. Human values will be embedded in that model in exactly the same way that they're embedded in physical humans. Likewise, GPT-style models should have no trouble learning some model with human values embedded in it. But that embedding will not necessarily be simple; there won't just be a neuron that lights up in response to humans having their values met. The model will have a notion of human values embedded in it, but it won't actually use "human values" as an abstract object in its internal calculations; it will work with some lower-level "components" which themselves implement/embed human values. I am definitely not talking about bias-variance tradeoff. I am talking about compute-accuracy tradeoff. Again, think about the example of Bayesian updates on a low-level physical model: there is no bias-variance tradeoff there. It's the ideal model, full stop. The reason we can't use it is because we don't have that much compute. In order to get computationally tractable models, we need to operate at higher levels of abstraction than "simulate all these quantum fields". Aligning superhuman AI, not just creating it. If you're unpersuaded, you should go leave feedback on Alignment as Translation [https://www.lesswrong.com/posts/42YykiTqtGMyJAjDM/alignment-as-translation], which directly talks about alignment as an interface problem.

Summary: the sense in which human values are "complex" is not about predictive power. A low-level physical model of some humans has everything there is to know about human values embedded within it; it has all the predictive power which can be had by a good model of human values. The hard part is pointing to the thing we consider "human values" embedded within that model. In large part, that's hard because it's not just a matter of predictive power.

  1. This still sounds like a shift in arguments to me. From what I remember, the MIRI-sphere take on upload

... (read more)
2johnswentworth3y
This is close, though not exactly what I want to claim. It's not that "dogs" are a "more natural category" in the sense that there are fewer similar categories which are hard to tell apart. Rather, it's that "dogs" are a less abstract natural category. Like, "human" is a natural category in roughly the same way as "dog", but in order to get to "human values" we need a few more layers of abstraction on top of that, and some of those layers have different type signatures than the layers below - e.g. we need not just a notion of "human", but a notion of humans "wanting things", which requires an entirely different kind of model from recognizing dogs. And that's exactly the kind of complexity which is hard for something based on predictive power. Lower abstraction levels should generally perform better in terms of raw prediction, but the thing we want to point to lives at a high abstraction level. We are able to get systems to learn some abstractions just by limiting compute - today's deep nets have nowhere near the compute to learn to do Bayesian updates on low-level physics, so it needs to learn some abstraction. But the exact abstraction level learned is always going to be a tradeoff between available compute and predictive power. I do think there's probably a wide range of parameters which would end up using the "right" level of abstraction for human values to be "natural", but we don't have a good way to recognize when that has/hasn't happened, and relying on it happening would be a crapshoot. (Also, sandwiches are definitely a natural category. Just because a cluster has edge cases does not make it any less of a cluster, even if a bunch of trolls argue about the edge cases. "Tasty ice cream flavors" is also a natural category if we know who the speaker is, which is exactly how humans understand the phrase in practice.) Part of what I mean here by "already have lots of support" is that there's already a path to improvement on these sorts of problems, not necess
Load More