...The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so. However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal
FWIW I agree with you and wouldn't put it the way it is in Roger's post. Not sure what Roger would say in response.
In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:
More generally:
We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.
We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.
We would appreciate your feedback, whether or not you agree with us:
This is an excerpt from the Introductions section to a book-length project that was kicked off as a response to the framing of the essay competition on the Automation of Wisdom and Philosophy. Many unrelated-seeming threads open in this post, that will come together by the end of the overall sequence.
If you don't like abstractness, the first few sections may be especially hard going.
This sequence is a new story of generalization.
The usual story of progress in generalization, such as in a very general theory, is via the uncovering of deep laws. Distilling the real patterns, without any messy artefacts. Finding the necessities and universals, that can handle wide classes rather than being limited to particularities. The crisp, noncontingent abstractions. It is about opening black boxes. Articulating mind-independent, rigorous results, with no ambiguity and high...
A simple, weak notion of corrigibility is having a "complete" feedback interface. In logical induction terms, I mean the AI trainer can insert any trader into the market. I want to contrast this with "partial" feedback, in which only some propositions get feedback and others ("latent" propositions) form the structured hypotheses which help predict the observable propositions -- for example, RL, where only rewards and sense-data is observed.
(Note: one might think that the ability to inject traders into LI is still "incomplete" because traders can give feedback on the propositions themselves, not on other traders; so the trader weights constitute "latents" being estimated. However, a trader can effectively vote against another trader by computing all that trader's trades and counterbalancing them. Of course, we can also more...
How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?
[I learned the term teleosemantics from you! :) ]
The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.
LI defines a notion of logically uncertain variable, which can be used to represent desires
I would say that they do...