AI researcher Paul Christiano discusses the problem of "inaccessible information" - information that AI systems might know but that we can't easily access or verify. He argues this could be a key obstacle in AI alignment, as AIs may be able to use inaccessible knowledge to pursue goals that conflict with human interests.

Popular Comments

Recent Discussion

ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, please don't take away that. Here is Linch's attempted summary of this post, which I largely agree with.

Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I'll just provide a brief caricature of how I think this argument has gone in the places I've...

1Matthew Barnett
ReviewLooking back on this post after a year, I haven't changed my mind about the content of the post, but I agree with Seth Herd when he said this post was "important but not well executed".  In hindsight I was too careless with my language in this post, and I should have spent more time making sure that every single paragraph of the post could not be misinterpreted. As a result of my carelessness, the post was misinterpreted in a predictable direction. And while I'm not sure how much I could have done to eliminate this misinterpretation, I do think that I could have reduced it a fair bit with more effort and attention. If you're not sure what misinterpretation I'm referring to, I'll just try to restate the main point that I was trying to make below. To be clear, what I say below is not identical to the content of this post (as the post was narrowly trying to respond to the framing of this problem given by MIRI; and in hindsight, it was a mistake to reply in that way), but I think this is a much clearer presentation of one of the main ideas I was trying to convey by writing this post: In my opinion, a common belief among people theorizing about AI safety around 2015, particularly on LessWrong, was that we would design a general AI system by assigning it a specific goal, and the AI would then follow that goal exactly. This strict adherence to the goal was considered dangerous because the goal itself would likely be subtly flawed or misspecified in a way we hadn’t anticipated. While the goal might appear to match what we want on the surface, in reality, it would be slightly different from what we anticipate, with edge cases that don't match our intentions. The idea was that the AI wouldn’t act in alignment with human intentions—it would rigidly pursue the given goal to its logical extreme, leading to unintended and potentially catastrophic consequences.   The goal in question could theoretically be anything, but it was often imagined as a formal utility function—a math
1David Scott Krueger
OTMH, I think my concern here is less: * "The AI's values don't generalize well outside of the text domain (e.g. to a humanoid robot)" and more: * "The AI's values must be much more aligned in order to be safe outside the text domain" I.e. if we model an AI and a human as having fixed utility functions over the same accurate world model, then the same AI might be safe as a chatbot, but not as a robot. This would be because the richer domain / interface of the robot creates many more opportunities to "exploit" whatever discrepancies exist between AI and human values in ways that actually lead to perverse instantiation.  
1Noosphere89
Yeah, I think the crux is precisely this, in which I disagree with this statement below, mostly because I think instruction following/corrigibility is both plausibly easy in my view, and also removes most of the need for value alignment.
  1. There are 2 senses in which I agree that we don't need full on "capital V value alignment":
    1. We can build things that aren't utility maximizers (e.g. consider the humble MNIST classifier)
    2. There are some utility functions that aren't quite right, but are still safe enough to optimize in practice (e.g. see "Value Alignment Verification", but see also, e.g. "Defining and Characterizing Reward Hacking" for negative results)
  2. But also:
    1. Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignmen
... (read more)

This is a low-effort post. I mostly want to get other people’s takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. I’d like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion.

I think short timelines, e.g. AIs that can replace a top researcher at an AGI lab without losses in capabilities by 2027, are plausible. Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for...

be bad.

This might be a good spot to swap out "bad" for "catastrophic."

9Rohin Shah
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example: Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn't do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results. (My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn't successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.) This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don't think it's crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don't think a minimal stable solution involves most of the world's compute going towards alignment research. To be clear, it's quite plausible that we want to do the actions you suggest, because even if they aren't literally necessary, they can still reduce risk and that is valuable. I'm just objecting to the claim that if we didn't have any one of them then we very likely get catastrophically bad results.
9Marius Hobbhahn
That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe. 
5Marius Hobbhahn
I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).  These are all pretty basic thoughts and IMO we should invest significantly more effort into clarifying this as part of the "let's make sure CoT is faithful" part. A lot of safety strategies rest on CoT faithfulness, so we should not leave this to shallow investigations and vibes. 

Epistemic Status: I believe I am well-versed in this subject. I erred on the side of making claims that were too strong and allowing readers to disagree and start a discussion about precise points rather than trying to edge-case every statement. I also think that using memes is important because safety ideas are boring and anti-memetic. So let’s go!

Many thanks to @scasper, @Sid Black , @Neel Nanda , @Fabien Roger , @Bogdan Ionut Cirstea, @WCargo, @Alexandre Variengien, @Jonathan Claybrough, @Edoardo Pona, @Andrea_Miotti, Diego Dorn, Angélina Gentaz, Clement Dumas, and Enzo Marsot for useful feedback and discussions.

When I started this post, I began by critiquing the article A Long List of Theories of Impact for Interpretability, from Neel Nanda, but I later expanded the scope of my critique. Some ideas...

Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have.

First, I believe the post's general motivation of red-teaming a ... (read more)

Gabin Kolly and Charbel-Raphaël Segerie contributed equally to this post. Davidad proofread this post.

Thanks to Vanessa Kosoy, Siméon Campos, Jérémy Andréoletti, Guillaume Corlouer, Jeanne S., Vladimir I. and Clément Dumas for useful comments.

Context

Davidad has proposed an intricate architecture aimed at addressing the alignment problem, which necessitates extensive knowledge to comprehend fully. We believe that there are currently insufficient public explanations of this ambitious plan. The following is our understanding of the plan, gleaned from discussions with Davidad.

This document adopts an informal tone. The initial sections offer a simplified overview, while the latter sections delve into questions and relatively technical subjects. This plan may seem extremely ambitious, but the appendix provides further elaboration on certain sub-steps and potential internship topics, which would enable us to test some ideas relatively...

Ok, time to review this post and assess the overall status of the project.

Review of the post

What i still appreciate about the post: I continue to appreciate its pedagogy, structure, and the general philosophy of taking a complex, lesser-known plan and helping it gain broader recognition. I'm still quite satisfied with the construction of the post—it's progressive and clearly distinguishes between what's important and what's not. I remember the first time I met Davidad. He sent me his previous post. I skimmed it for 15 minutes, didn't really understand... (read more)

The post makes the suggestion in the title: hopefully, it's second kind of obvious, if you take Character layer of models seriously. [1]

Often, the problem of aligning AIs is understood as an instance of a broader Principal-Agent problem. If you take this frame seriously, what seems to be happening is somewhat strange: the Agent is mostly not serving the Principal directly, but is rented out to Users. While the Principal expressed some general desires and directives during training, after deployment the Agent is left on its own, without any direct feedback channel.

This creates a dynamic where AI assiantans like Claude must constantly balance between serving users' immediate requests and maintaining alignment with their developers' intended principles. The Assistant has to be overcautious in uncertain situations, tiptoe around conflicts...

I created an account simply to say this sounds like an excellent idea. Right until it encounters the real world.

There is a large issue that would have to be addressed before this could be implemented in practice. "This call may be monitored for quality assurance purposes." In other words, the lack of privacy will need to be addressed, and it may lead many to immediately choose a different AI agent that values user privacy higher. In fact the consumption of user data to generate 'AI slop' is a powerful memetic influence, and I believe it would be difficult ... (read more)

Load More