Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so.  However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal

...
9Daniel Kokotajlo
Good point re 2. Re 1, meh, still seems like a meta-argument to me, because when I roll out my mental simulations of the ways the future could go, it really does seem like my If... condition obtaining would cut out about half of the loss-of-control ones. Re 3: point by point: 1. AISIs existing vs. not: Less important; I feel like this changes my p(doom) by more like 10-20% rather than 50%. 2. Big names coming out: idk this also feels like maybe 10-20% rather than 50% 3. I think Anthropic winning the race would be a 40% thing maybe, but being a runner-up doesn't help so much, but yeah p(anthropicwins) has gradually gone up over the last three years... 4. Trump winning seems like a smaller deal to me. 5. Ditto for Elon. 6. Not sure how to think about logical updates, but yeah, probably this should have swung my credence around more than it did. 7. ? This was on the mainline path basically and it happened roughly on schedule. 8. Takeoff speeds matter a ton, I've made various updates but nothing big and confident enough to swing my credence by 50% or anywhere close. Hmm. But yeah I agree that takeoff speeds matter more. 9. Picture here hasn't changed much in three years. 10. Ditto. OK, so I think I directionally agree that my p(doom) should have been oscillating more than it in fact did over the last three years (if I take my own estimates seriously). However I don't go nearly as far as you; most of the things you listed are either (a) imo less important, or (b) things I didn't actually change my mind on over the last three years such that even though they are very important my p(doom) shouldn't have been changing much. I agree with everything except the last sentence -- my claim took this into account, I was specifically imagining something like this playing out and thinking 'yep, seems like this kills about half of the loss-of-control worlds' I agree that's a stronger claim than I was making. However, part of my view here is that the weaker claim I did make has a
1Vladimir Mikulik
The reasoning for the indistinguishability criterion is as a precaution against a model sandbagging when used as the red team attack model. If the model can tell it’s in a control eval it could pretend to be worse at attacking, making us think the control measures are better than they actually are. If the red team elicitation is robust to this, I agree this isn’t necessary. (See the note below [P2.4]).
5Buck Shlegeris
I don't think this makes any sense. How are you hoping to get the model to attack except by telling it that it's in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.

FWIW I agree with you and wouldn't put it the way it is in Roger's post. Not sure what Roger would say in response.

3Thomas Kwa
What's the most important technical question in AI safety right now?

In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:

  • How to evaluate whether models should be trusted or untrusted: currently I don't have a good answer and this is bottlenecking the efforts to write concrete control proposals.
  • How AI control should interact with AI security tools inside labs.

More generally:

... (read more)
This is a linkpost for https://www.thecompendium.ai/

We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.

We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.

We would appreciate your feedback, whether or not you agree with us:

  • If you do agree with us, please point out where you
...

Typo addressed in the latest patch!

2Adam Shimi
Now addressed in the latest patch!
2Adam Shimi
Now addressed in the latest patch!
2Adam Shimi
Thanks for the comment! We have indeed gotten the feedback by multiple people that this part didn't feel detailed enough (although we got this much more from very technical readers than from non-technical ones), and are working at improving the arguments.

This is an excerpt from the Introductions section to a book-length project that was kicked off as a response to the framing of the essay competition on the Automation of Wisdom and Philosophy. Many unrelated-seeming threads open in this post, that will come together by the end of the overall sequence.

If you don't like abstractness, the first few sections may be especially hard going. 

 

Generalization

This sequence is a new story of generalization.

The usual story of progress in generalization, such as in a very general theory, is via the uncovering of deep laws. Distilling the real patterns, without any messy artefacts. Finding the necessities and universals, that can handle wide classes rather than being limited to particularities. The crisp, noncontingent abstractions. It is about opening black boxes. Articulating mind-independent, rigorous results, with no ambiguity and high...

A simple, weak notion of corrigibility is having a "complete" feedback interface. In logical induction terms, I mean the AI trainer can insert any trader into the market. I want to contrast this with "partial" feedback, in which only some propositions get feedback and others ("latent" propositions) form the structured hypotheses which help predict the observable propositions -- for example, RL, where only rewards and sense-data is observed.

(Note: one might think that the ability to inject traders into LI is still "incomplete" because traders can give feedback on the propositions themselves, not on other traders; so the trader weights constitute "latents" being estimated. However, a trader can effectively vote against another trader by computing all that trader's trades and counterbalancing them. Of course, we can also more...

2Steve Byrnes
This is a confusing post from my perspective, because I think of LI as being about beliefs and corrigibility being about desires. If I want my AGI to believe that the sky is green, I guess it’s good if it’s possible to do that. But it’s kinda weird, and not a central example of corrigibility. Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? If so, well, I’m generally very skeptical of attempts to do that kind of thing. See here, especially Section 7. In the case of humans, it’s perfectly possible for a plan to seem desirable but not plausible, or for a plan to seem plausible but not desirable. I think there are very good reasons that our brains are set up that way.
4Abram Demski
No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing. As I mentioned in the post, I'm agnostic about such things here. We could be building """purely epistemic""" AI out of LI, or we could be deliberately building agents. It doesn't matter very much, in part because we don't have a good notion of purely epistemic.  * Any learning system with a sufficiently rich hypothesis space can potentially learn to behave agentically (whether we want it to or not, until we have anti-inner-optimizer tech), so we should still have corrigibility concerns about such systems. * In my view, beliefs are a type of decision (not because we smoosh beliefs and values together, but rather because beliefs can have impacts on the world if the world looks at them) which means we should have agentic concerns about beliefs. * Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?

[I learned the term teleosemantics from you!  :) ]

The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.

LI defines a notion of logically uncertain variable, which can be used to represent desires

I would say that they do... (read more)

2Abram Demski
Yeah, I totally agree. This was initially a quick private message to someone, but I thought it was better to post it publicly despite the inadequate explanations. I think the idea deserves a better write-up.
Load More