Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

Olive Branch

I think the contest idea is great and aimed at two absolute core alignment problems. I'd be surprised if much comes out of it, as these are really hard problems and I'm not sure contests are a good way to solve really hard problems. But it's worth trying!

Now, a bit of a rant:

Submissions will be judged on a rolling basis by Richard Ngo, Lauro Langosco, Nate Soares, and John Wentworth.

I think this panel looks very weird to ML people. Very quickly skimming the Scholar profiles, it looks like the sum of first-author papers in top ML conferences published by these four people is one (Goal Misgeneralisation by Lauro et al.). The person with the most legible ML credentials is Lauro, who's an early-year PhD student with 10 citations.

Look, I know Richard and he's brilliant. I love many of his papers. I bet that these people are great researchers and can judge this contest well. But if I put myself into the shoes of an ML researcher who's not part of the alignment community, this panel sends a message: "wow, the alignment community has hundreds of thousands of dollars, but can't even find a single senior ML researcher crazy enough to entertain their ideas".

There are plenty of people who understand the alignment problem very well and who also have more ML credentials. I can suggest some, if you want.

(Probably disregard this comment if ML researchers are not the target audience for the contests.)

[-]Sam Bowman3y513

+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular 'core' AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research.

(To be fair, I think the Inverse Scaling Prize, which I'm helping with, raises some of these concerns, but the more precise/partially-quantifiable prize rubric, bigger/more diverse panel, and use of additional reviewers outside the panel mitigates them at least partially.)

[-]Orpheus163y68

Hastily written; may edit later

Thanks for mentioning this, Jan! We'd be happy to hear suggestions for additional judges. Feel free to email us at akash@alignmentawards.com and olivia@alignmentawards.com.

Some additional thoughts:

We chose judges primarily based on their expertise and (our perception of) their ability to evaluate submissions about goal misgeneralization and corrigibility. Lauro, Richard, Nate, and John ade some of few researchers who have thought substantially about these problems. In particular, Lauro first-authored the first paper about goal misgeneralization and Nate first-authored a foundational paper about corrigibility.
We think the judges do have some reasonable credentials (e.g., Richard works at OpenAI, Lauro is a PhD student at the University of Cambridge, Nate Soares is the Executive Director of a research organization & he has an h-index of 12, as well as 500+ citations). I think the contest meets the bar of "having reasonably well-credentialed judges" but doesn't meet the bar of "having extremely well-credentialed judges (e.g., well-established professors with thousands of citations). I think that's fine.
We got feedback from several ML people before launching. We didn't get feedback that this looks "extremely weird" (though I'll note that research competitions in general are pretty unusual).
I think it's plausible that some people will find this extremely weird (especially people who judge things primarily based on the cumulative prestige of the associated parties & don't think that OpenAI/500 citations/Cambridge are enough), but I don't expect this to be a common reaction.

Some clarifications + quick thoughts on Sam’s points:

The contest isn’t aimed primarily/exclusively at established ML researchers (though we are excited to receive submissions from any ML researchers who wish to participate).
We didn’t optimize our contest to attract established researchers. Our contests are optimized to take questions that we think are at the core of alignment research and present them in a (somewhat less vague) format that gets more people to think about them.
We’re excited that other groups are running contests that are designed to attract established researchers & present different research questions.
All else equal, we think that precise/quantifiable grading criteria & a diverse panel of reviewers are preferable. However, in our view, many of the core problems in alignment (including goal misgeneralization and corrigibility) have not been sufficiently well-operationalized to have precise/quantifiable grading criteria at this stage.

[-]JanB3y*410

This response does not convince me.

Concretely, I think that if I'd show the prize to people in my lab and they actually looked at the judges (and I had some way of eliciting honest responses from them), I'd think that >60% would have some reactions according to what Sam and I described (i.e. seeing this prize as evidence that AI alignment concerns are mostly endorsed by (sometimes rich) people who have no clue about ML; or that the alignment community is dismissive of academia/peer-reviewed publishing/mainstream ML/default ways of doing science; or ... ).

Your point 3.) about the feedback from ML researchers could convince me that I'm wrong, depending on whom exactly you got feedback from and how that looked like.

By the way, I'm highlighting this point in particular not because it's highly critical (I haven't thought much about how critical it is), but because it seems relatively easy to fix.

[-]Koen.Holtman3y1-1

As one of the few AI safety researchers who has done a lot of work on corrigibility, I have mixed feelings about this.

First, great to see an effort that tries to draw more people to working on the corrigibility, because almost nobody is working on it. There are definitely parts of the solution space that could be explored much further.

What I also like is that you invite essays about the problem of making progress, instead of the problem of making more people aware that there is a problem.

However, the underlying idea that meaningful progress is possible by inviting people to work on a 500 word essay, which will then first be judged by 'approximately 10 Judges who are undergraduate and graduate students', seems to be a bit strange. I can fully understand Sam Bowman's comment that this might all look very weird to ML people. What you have here is an essay contest. Calling it a research contest may offend some people who are actual card-carrying researchers.

Also, the more experienced judges you have represent somewhat of an insular sub-community of AI safety researchers. Specifically, I associate both Nate and John with the viewpoint that alignment can only be solved by nothing less than an entire scientific revolution. This is by now a minority opinion inside the AI safety community, and it makes me wonder what will happen to submissions that make less radical proposals which do not buy into this viewpoint.

OK, I can actually help you with the problem of an unbalanced judging panel: I volunteer to join it. If you are interested, please let me know.

Corrigibility is both

a technical problem: inventing methods to make AI more corrigible
a policy problem: forcing people deploying AI to use those methods, even if this will hurt their bottom line, even if these people are careless fools, and even if they have weird ideologies.

Of these two problems, I consider the technical problem to be mostly solved by now, even for AGI.
The big open problem in corrigibility is the policy one. So I'd like to see contest essays that engage with the policy problem.

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs, rather than speculation or gut feelings. Of course, in the AI safety activism blogosphere, almost nobody wants to read or talk about these methods in the papers with the proofs, instead everybody bikesheds the proposals which have been stated in natural language and which have been backed up only by speculation and gut feelings. This is just how a blogosphere works, but it does unfortunately add more fuel to the meme that the technical side of corrigibility is mostly unsolved and that nobody has any clue.

[-]Vivek Hebbar3y30

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here? No need to write anything, just links.

[-]Koen.Holtman3y*41

OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.

This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.

I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).

Math-based work on corrigibility solutions typically starts with formalizing corrigibility, or a sub-component of corrigibility, as a mathematical property we want an agent to have. It then constructs such an agent with enough detail to show that this property is indeed correctly there, or at least there during some part of the agent lifetime, or there under some boundary assumptions.

Not all of the papers below have actual mathematical proofs in them, some of them show correctness by construction. Correctness by construction is superior to having to have proofs: if you have correctness by construction, your notation will usually be much more revealing about what is really going on than if you need proofs.

Here is the list, with the bold headings describing different approaches to corrigibility.

Indifference to being switched off, or to reward function updates

Motivated Value Selection for Artificial Agents introduces Armstrong's indifference methods for creating corrigibility. It has some proofs, but does not completely work out the math of the solution to a this-is-how-to-implement-it level.

Corrigibility tried to work out the how-to-implement-it details of the paper above but famously failed to do so, and has proofs showing that it failed to do so. This paper somehow launched the myth that corrigibility is super-hard.

AGI Agent Safety by Iteratively Improving the Utility Function does work out all the how-to-implement-it details of Armstrong's indifference methods, with proofs. It also goes into the epistemology of the connection between correctness proofs in models and safety claims for real-world implementations.

Counterfactual Planning in AGI Systems introduces a different and more easy to interpret way for constructing a a corrigible agent, and agent that happens to be equivalent to agents that can be constructed with Armstrong's indifference methods. This paper has proof-by-construction type of math.

Corrigibility with Utility Preservation has a bunch of proofs about agents capable of more self-modification than those in Counterfactual Planning. As the author, I do not recommend you read this paper first, or maybe even at all. Read Counterfactual Planning first.

Safely Interruptible Agents has yet another take on, or re-interpretation of, Armstrong's indifference methods. Its title and presentation somewhat de-emphasize the fact that it is about corrigibility, by never even discussing the construction of the interruption mechanism. The paper is also less clearly about AGI-level corrigibility.

How RL Agents Behave When Their Actions Are Modified is another contribution in this space. Again this is less clearly about AGI.

Agents that stop to ask a supervisor when unsure

A completely different approach to corrigibility, based on a somewhat different definition of what it means to be corrigible, is to construct an agent that automatically stops and asks a supervisor for instructions when it encounters a situation or decision it is unsure about. Such a design would be corrigible by construction, for certain values of corrigibility. The last two papers above can be interpreted as disclosing ML designs that also applicable in the context of this stop when unsure idea.

Asymptotically unambitious artificial general intelligence is a paper that derives some probabilistic bounds on what can go wrong regardless, bounds on the case where the stop-and-ask-the-supervisor mechanism does not trigger. This paper is more clearly about the AGI case, presenting a very general definition of ML.

Anything about model-based reinforcement learning

I have yet to write a paper that emphasizes this point, but most model-based reinforcement learning algorithms produce a corrigible agent, in the sense that they approximate the ITC counterfactual planner from the counterfactual planning paper above.

Now, consider a definition of corrigibility where incompetent agents (or less inner-aligned agents, to use a term often used here) are less corrigible because they may end up damaging themselves, their stop buttons. or their operator by being incompetent. In this case, every convergence-to-optimal-policy proof for a model-based RL algorithm can be read as a proof that its agent will be increasingly corrigible under learning.

CIRL

Cooperative Inverse Reinforcement Learning and The Off-Switch Game present yet another corrigibility method with enough math to see how you might implement it. This is the method that Stuart Russell reviews in Human Compatible. CIRL has a drawback, in that the agent becomes less corrigible as it learns more, so CIRL is not generally considered to be a full AGI-level corrigibility solution, not even by the original authors of the papers. The CIRL drawback can be fixed in various ways, for example by not letting the agent learn too much. But curiously, there is very little followup work from the authors of the above papers, or from anybody else I know of, that explores this kind of thing.

Commanding the agent to be corrigible

If you have an infinitely competent superintelligence that you can give verbal commands to that it will absolutely obey, then giving it the command to turn itself into a corrigible agent will trivially produce a corrigible agent by construction.

Giving the same command to a not infinitely competent and obedient agent may give you a huge number of problems instead of course. This has sparked endless non-mathematical speculation, but in I cannot think of a mathematical paper about this that I would recommend.

AIs that are corrigible because they are not agents

Plenty of work on this. One notable analysis of extending this idea to AGI-level prediction, and considering how it might produce non-corrigibility anyway, is the work on counterfactual oracles. If you want to see a mathematically unambiguous presentation of this, with some further references, look for the section on counterfactual oracles in the Counterfactual Planning paper above.

Myopia

Myopia can also be considered to be feature that creates or improves or corrigibility. Many real-world non-AGI agents and predictive systems are myopic by construction: either myopic in time, in space, or in other ways. Again, if you want to see this type of myopia by construction in a mathematically well-defined way when applied to AGI-level ML, you can look at the Counterfactual Planning paper.

[-]Towards_Keeperhood3y10

Hi Koen, thank you very much for writing this list!

I must say I'm skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that's not at all a clear definition yet, I'm still deconfusing myself about that, and I'll likely publish a post clarifying the problem how I see it within the next month.)

So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I'd expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all).

But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn't a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write? (I guess I'll do it sometime this week of you write me back, but no promises.)

Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)

[-]Koen.Holtman3y10

Hi Simon! You are welcome! By the way, I very much want to encourage you to be skeptical and make up your own mind.

I am guessing that by mentioning consequentialist, you are referring to this part of Yudkowsky's list of doom:

Corrigibility is anti-natural to consequentialist reasoning

I am not sure how exactly Yudkowsky is defining the terms corrigibility or consequentalist here, but I might actually be agreeing with him on the above statement, depending on definitions.

I suggest you read my paper Counterfactual Planning in AGI Systems, because it is the most accessible and general one, and because it presents AGI designs which can be interpreted as non-consequentualist.

I could see consequentialist AGI being stably corrigible if it is placed in a stable game-theoretical environment where deference to humans literally always pays as a strategy. However, many application areas for AI or potential future AGI do not offer such a stable game-theoretical environment, so I feel that this technique has very limited applicability.

If we use the 2015 MIRI paper definition of corrigibility, the alignment tax (the extra engineering and validation effort needed) for implementing corrigibility in current-generation AI systems is low to non-existent. The TL;DR here is: avoid using a bunch of RL methods that you do not want to use anyway when you want any robustness or verifiability. As for future AGI, the size of the engineering tax is open to speculation. My best guess is that future AGI will be built, if ever, by leveraging ML methods that still resemble world model creation by function approximation, as opposed to say brain uploading. Because of this, and some other reasons, I estimate a low safety engineering tax to achieve basic corrigibility.

Other parts of AGI alignment may be very expensive. e.g. the part of actually monitoring an AGI to make sure its creativity is benefiting humanity, instead of merely finding and exploiting loopholes in its reward function that will hurt somebody somewhere. To the extent that alignment cannot be cheap, more regulation will be needed to make sure that operating a massively unaligned AI will always be more expensive for a company to do than operating a mostly aligned AI. So we are looking at regulatory instruments like taxation, fines, laws that threaten jail time, and potentially measures inside the semiconductor supply chain, all depending on what type of AGI will become technically feasible, if ever.

[-]Towards_Keeperhood3y00

Thank you! I'll likely read your paper and get back to you. (Hopefully within a week.)

From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that's how Yudkowsky means it, but not sure if that's what most people mean when they use the term.) (Though not that this implies that you need so much consequentialism that we won't be able to shut down the AGI. But as I see it a theoretical solution to corrigibility needs to deal with consequentialism. I haven't looked into your paper yet, so it's well possible that my comment here might appear misguided.) E.g. if we just built a gigantic transformer and train it on all human knowledge (and say we have a higher sample efficiency or so), it is possible that it can do almost everything humans can do. But it won't be able to just one-shot solve quantum gravity or so when we give it the prompt "solve quantum gravity". There is no runtime updating/optimization going on, i.e. the transformer is non-consequentialist. All optimization happened through the training data or gradient descent. Either the human training data was already sufficient to encode a solution to quantum gravity in the patterns of the transformer, or it wasn't. It is theoretically possible that the transformer learns a bit deeper underlying patterns than humans have (though I do not expect that from sth like the transformer architecture), and is so able to generalize a bit further than humans. But it seems extremely unlikely that it learned so deep understanding to already have the solution to quantum gravity encoded, although it was never explicitly trained to learn that and just read physics papers. The transformer might be able to solve quantum gravity if it can recursively query itself to engineer better prompts, or if it can give itself feedback which is then somehow converted into gradient descent updates and then try multiple times. But in those cases there is consequentialist reasoning again. The key point: Consequentialism becomes necessary when you go beyond human level.

Out of interest, how much do you agree with what I just wrote?

[-]Koen.Holtman3y10

I think I agree to most of it: I agree that some form of optimization or policy search is needed to get many things you want to use AI for. But I guess you have to read the paper to find out the exact subtle way in which the AGIs inside can be called non-consequentialist. To quote Wikipedia:

In ethical philosophy, consequentialism is a class of normative, teleological ethical theories that holds that the consequences of one's conduct are the ultimate basis for judgment about the rightness or wrongness of that conduct.

I do not talk about this in the paper, but in terms of ethical philosophy, the key bit about counterfactual planning is that it asks: judge one's conduct by its consequences in what world exactly? Mind you, the problem considered is that we have to define the most appropriate ethical value system for a robot butler, not what is most appropriate for a human.

[-]Vivek Hebbar3y*20

ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation

Update: I started reading your paper "Corrigibility with Utility Preservation".^[1] My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,^[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems. Nonetheless, it's a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.^[3] Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists). In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful).

So looking at your claim that "the technical problem [is] mostly solved", this may or may not be true for the narrow sense (like "corrigibility as a theoretical outer-objective problem in formally-specified environments"), but seems false and misleading for the broader practical sense ("knowing how to make an AGI corrigible in real life").^[4]

Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt^[5]:

"In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent's decision procedure] to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this conclusion is due to the use of a Platonic agent model."

^{^}
Btw, your writing is admirably concrete and clear.
Errata: Subscripts seem to broken on page 9, which significantly hurts readability of the equations. Also there is a double-typo "I this paper, we the running example of a toy universe" on page 4.
^{^}
Assuming the idea is correct
^{^}
Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?
^{^}
I'm not necessarily accusing you of any error (if the contest is fixated on the utility function version), but it was misleading to me as someone who read your comment but not the contest details.
^{^}
Portions in [brackets] are insertions/replacements by me

[-]Koen.Holtman3y*10

Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.

To comment on your quick thoughts:

My later papers spell out the ML analog of the solution in `Corrigibility with' more clearly.
On your question of Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky's arguments for pessimism.
On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. Partly this is because I am optimistic about future competent regulation of AGI-level AI by humans successfully banning certain dangerous AGI architectures outright, much more optimistic than Yudkowsky is.
I do not think I fully support my 2019 statement anymore that 'Part of this conclusion [of Soares et al. failing to solve corrigibility] is due to the use of a Platonic agent model'. Nowadays, I would say that Soares et al did not succeed in its aim because it used a conditional probability to calculate what should have been calculated by a Pearl counterfactual. The Platonic model did not figure strongly into it.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

16

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

16

What are the contests?

What types of submissions are you interested in?

Why are you running these contests?

Who can participate?

What if I’ve never done AI alignment research before?

How can I help?

Outlook