Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

20Richard Ngo
Some opinions about AI and epistemology: 1. One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues.  2. A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either. 3. For example: there's no privileged way to combine many people's opinions into a single credence. You can average them, but that loses a lot of information. Or you can get them to bet on a prediction market, but that depends on a lot on details of the individuals' betting strategies. The group might settle on a number to help with planning and communication, but it's only a lossy summary of many different beliefs and models. Similarly, we should think of individuals' credences as lossy summaries of different opinions from different underlying models that they have. 4. How does this apply to AI? Suppose we think of ourselves as having many different subagents that focus on understanding the world in different ways - e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn't mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn't cause all the other humans to elect them as the dictator). E.g. maybe there's an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions). 5. In my debate with Eliezer, he didn't seem to appreciate the importance of advance pre

I may be missing context here, but as written / taken at face value, I strongly agree with the above comment from Richard.  I often disagree with Richard about alignment and its role in the future of AI, but this comment is an extremely dense list of things I agree with regarding rationalist epistemic culture.

(Part 3b of the CAST sequence)

In the first half of this document, Towards Formal Corrigibility, I sketched a solution to the stop button problem. As I framed it, the solution depends heavily on being able to detect manipulation, which I discussed on an intuitive level. But intuitions can only get us so far. Let’s dive into some actual math and see if we can get a better handle on things.

Measuring Power

To build towards a measure of manipulation, let’s first take inspiration from the suggestion that manipulation is somewhat the opposite of empowerment. And to measure empowerment, let’s begin by trying to measure “power” in someone named Alice. Power, as I touched on in the ontology in Towards Formal Corrigibility, is (intuitively) the property of having one’s values/goals...

10Wei Dai
I don't know, maybe it's partially or mostly my fault for reading too much optimism into these passages... But I think it would have managed my expectations better to say something like "my notion of corrigibility heavily depends on a subnotion of 'don't manipulate the principals' values' which is still far from being well-understood or formalizable." Switching topics a little, I think I'm personally pretty confused about what human values are and therefore what it means to not manipulate someone's values. Since you're suggesting relying less on formalization and more on "examples of corrigibility collected in a carefully-selected dataset", how would you go about collecting such examples? (One concern is that you could easily end up with a dataset that embodies a hodgepodge of different ideas of what "don't manipulate" means and then it's up to luck whether the AI generalizes from that in a correct or reasonable way.)
1Max Harms
Thanks. Picking out those excerpts is very helpful. I've jotted down my current (confused) thoughts about human values. But yeah, I basically think one needs to start with a hodgepodge of examples that are selected for being conservative and uncontroversial. I'd collect them by first identifying a robust set of very in-distribution tasks and contexts and try to exhaustively identify what manipulation would look like in that small domain, then aggressively train on passivity outside of that known distribution. The early pseudo-agent will almost certainly be mis-generalizing in a bunch of ways, but if it's set up cautiously we can suspect that it'll err on the side of caution, and that this can be gradually peeled back in a whitelist-style way as the experimentation phase proceeds and attempts to nail down true corrigibility.
1Seth Herd
I think you're right to point to this issue. It's a loose end. I'm not at all sure it's a dealbreaker for corrigibility. The core intuition/proposal is (I think) that a corrigible agent wants to do what the principal wants, at all times. If the principal currently wants to not have their future values/wants manipulated, then the corrigible agent wants to not do that. If they want to be informed but protected against outside manipulation, then the corrigible agent wants that. The principal will want to balance these factors, and the corrigible agent wants to figure out what balance their principal wants, and do that. I was going to say that my instruction-following variant of corrigibility might be better for working out that balance, but it actually seems pretty straightforward in Max's pure corrigibility version, now that I've written out the above.

I don't think "a corrigible agent wants to do what the principal wants, at all times" matches my proposal. The issue that we're talking here shows up in the math, above, in that the agent needs to consider the principal's values in the future, but those values are themselves dependent on the agent's action. If the principal gave a previous command to optimize for having a certain set of values in the future, sure, the corrigible agent can follow that command, but to proactively optimize for having a certain set of values doesn't seem necessarily corrigible... (read more)

Summary: A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse. Any implemented Corrigibility method will necessarily be built on top of a set of unexamined implicit assumptions. One of those assumptions could be true for a PAAI, but false for a CEV style AI. The present post outlines one specific scenario where this happens. This scenario involves a Corrigibility method that only works for an AI design, if that design does not imply an identifiable outcome. The method fails when it is applied to an AI design, that does imply an identifiable outcome. When such an outcome does exist, the ``corrigible'' AI will ``explain'' this implied outcome, in a way that makes the designers...

1Max Harms
I'm confused here. Is the corrigible AI trying to get the IO to happen? Why is it trying to do this? Doesn't seem very corrigible, but I think I'm probably just confused. Maybe another frame on my confusion is that it seems to me that a corrigible AI can't have an IO?
Thank you for engaging. If this was unclear for you, then I'm sure it was also unclear for others. The post outlined a scenario where a Corrigibility method works perfectly for one type of AI (an AI that does not imply an identifiable outcome, for example a PAAI). The same Corrigibility method fails completely for another type of AI (an AI that does imply an identifiable outcome, for example PCEV). So the second AI, that does have an IO, is indeed not corrigible. This Corrigibility method leads to an outcome that is massively worse than extinction. This bad outcome is the result of two things being true, (i): the fully Corrigible first AI made this outcome possible to reach, and (ii): since the Corrigibility method worked perfectly for the first AI, the designers falsely believed that the Corrigibility method would also work for the second AI. The second AI wants many things. It wants to get an outcome, as close as possible to IO. The Corrigibility method resulted in the second AI also wanting many additional things (such as wanting all explanations it gives to count as AE, even if this makes the explanations less efficient. And wanting to avoid implementing anything, unless informed designers want that thing to be implemented). But in practice the Corrigibility method does not change the outcome in any way (it just adds an ``explanation step''). So I think it makes sense to say that the second AI has ``zero Corrigibility''. The first AI is completely corrigible. And if the designers had only used the Corrigibility method for the first AI, then the Corrigibility method would have worked perfectly. This is what I was trying to communicate with the first sentence of the post:  ``A Corrigibility method that works for a Pivotal Act AI (PAAI) but fails for a CEV style AI could make things worse.''. I could have used that sentence as a title, but I decided against trying to include everything in the title. (I think it is ok to leave information out of the title, as lo

Thanks! I now feel unconfused. To briefly echo back the key idea which I heard (and also agree with): a technique which can create a corrigible PAAI might have assumptions which break if that technique is used to make a different kind of AI (i.e. one aimed at CEV). If we call this technique "the Corrigibility method" then we may end up using the Corrigibility method to make AIs that aren't at all corrigible, but merely seem corrigible, resulting in disaster.

This is a useful insight! Thanks for clarifying. :)

(Part 1 of the CAST sequence)

AI Risk Introduction

(TLDR for this section, since it’s 101 stuff that many readers will have already grokked: Misuse vs Mistake; Principal-Agent problem; Omohundro Drives; we need deep safety measures in addition to mundane methods. Jump to “Sleepy-Bot” if all that seems familiar.)

Earth is in peril. Humanity is on the verge of building machines capable of intelligent action that outstrips our collective wisdom. These superintelligent artificial general intelligences (“AGIs”) are almost certain to radically transform the world, perhaps very quickly, and likely in ways that we consider catastrophic, such as driving humanity to extinction. During this pivotal period, our peril manifests in two forms.

The most obvious peril is that of misuse. An AGI which is built to serve the interests of one person or party, such...

4Thomas Kwa
I am pro-corrigibility in general but there are parts of this post I think are unclear, not rigorous enough to make sense to me, or I disagree with. Hopefully this is a helpful critique, and maybe parts get answered in future posts. On definitions of corrigiblity You give an informal definition of "corrigible" as (C1): I have some basic questions about this. * Empowering the principal to fix its flaws and mistakes how? Making it closer to some perfectly corrigible agent? But there seems to be an issue here: * If the "perfectly corrigible agent" it something that only reflects on itself and tries to empower the principal to fix it, it would be useless at anything else, like curing cancer. * If the "perfectly corrigible agent" can do other things as well, there is a huge space of other misaligned goals it could have that it wouldn't want to remove. * Why would an agent whose *only* terminal/top-level goal is corrigibility gather a Minecraft apple when humans ask it to? It seems like a corrigible agent would have no incentive to do so, unless it's some galaxy-brained thing like "if I gather the Minecraft apple, this will move the corrigibility research project forward because it meets humans' expectations of what a corrigible agent does, which will give me more power and let me tell the humans how to make me more corrigible". * Later, you say "A corrigible agent will, if the principal wants its values to change, seek to be modified to reflect those new values."  * I do not see how C1 implies this, so this seems like a different aspect of corrigibility to me. * "reflect those new values" seems underspecified as it is unclear how a corrigible agent reflects values. Is it optimizing a utility function represented by the values? How does this trade off against corrigibility? Other comments: * In "What Makes Corrigibility Special", where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being co
  • In "What Makes Corrigibility Special", where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
    • Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn't really fit into the picture.
    • If not, it's not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don't conflict with our existing goals, including developing. E.g. if I have the goal of gr
... (read more)
2Max Harms
I'm going to respond piece-meal, since I'm currently writing in a limited timebox. I think obedience is an emergent behavior of corrigibility. The intuitive story here is that how the AI moves its body is a kind of action, and insofar as the principal gives a command, this is an attempt to "fix" the action to be one way as opposed to another. Responding to local, verbal instructions is a way of responding to the corrections of the principal. If the principal is able to tell the agent to fetch the apple, and the agent does so, the principal is empowered over the agent's behavior in a way that that would not be if the agent ignored them. More formally, I am confused exactly how to specify where the boundaries of power should be, but I show a straightforward way to derive something like obedience from empowerment in doc 3b. Overall I think you shouldn't get hung up on the empowerment frame when trying to get a deep handle on corrigibility, but should instead try to find a clean sense of the underlying generator and then ask how empowerment matches/diverges from that.
3Max Harms
I don't think there's a particular trick, here. I can verify a certain amount of wisdom, and have already used that to gain some trust in various people. I'd go to the people I trust and ask them how they'd solve the problem, then try to spot common techniques and look for people who were pointed to independently. I'd attempt to get to know people who were widely seen as trustworthy and understand why they had that reputation and try not to get Goodharted too hard. I'd try to get as much diversity as was reasonable while also still keeping the quality bar high, since diverse consensus is more robust than groupthink consensus. I'd try to select for old people who seem like they've been under intense pressure and thrived without changing deeply as people in the process. I'd try to select for people who were capable of cooperating and changing their minds when confronted by logic. I'd try to select for people who didn't have much vested interest, and seemed to me, in the days I spent with them, to be focused on legacy, principles, and the good of the many. To be clear, I don't think I could reliably pull this off if people were optimizing for manipulating, deceiving, and pressuring me. :shrug: I agree that false hope is a risk. In these documents I've tried to emphasize that I don't think this path is easy. I feel torn between people like you and Eliezer who take my tone as being overly hopeful and the various non-doomers who I've talked to about this work who see me as overly doomy. Suggestions welcome. I said I like the visualization because I do! I think I'd feel very happy if the governments of the world selected 5 people on the basis of wisdom and sanity to be the governors of AGI and the stewards of the future. Similarly, I like the thought of an AGI laboratory doing a slow and careful training process even when all signs point to the thing being safe. I don't trust governments to actually select stewards of the future just as I don't expect frontier labs to g

YouTube link

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them ‘aligned’. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting.

Topics we discuss:

Daniel Filan: Hello, everybody. In this episode I’ll be speaking with Scott Emmons. Scott is a PhD student at UC Berkeley, working with the Center for Human-Compatible AI on AI...

Load More