Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

In a recent appearance on Conversations with Tyler, famed political forecaster Nate Silver expressed skepticism about AIs replacing human forecasters in the near future. When asked how long it might take for AIs to reach superhuman forecasting abilities, Silver replied: “15 or 20 [years].”

In light of this, we are excited to announce “FiveThirtyNine,” an AI forecasting bot. Our bot, built on GPT-4o, provides probabilities for any user-entered query, including “Will Trump win the 2024 presidential election?” and “Will China invade Taiwan by 2030?” Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters; since crowds are for the most part superhuman, FiveThirtyNine is in a similar sense. (We discuss limitations later in this post.)

Our...

How did you handle issues of data contamination? 

In your technical report you say you validated performance for this AI system using retrodiction: 

Performance. To evaluate the performance of the model, we perform retrodiction, pioneered in Zou
et al. [3]. That is to say, we take questions about past events that resolve after the model’s pretraining
data cutoff date. We then compare the accuracy of the crowd with the accuracy of the model, both having access to the same amount of recent information. When we retrieve articles for the forecasting AI,

... (read more)

I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I’ll argue this could be hard even if you caught your AI red-handed trying to escape.

Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I...

gwern1112

If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.

How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply unde... (read more)

1Lukas Finnveden
What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?
12Oliver Habryka
You seem to be making a huge number of assumptions of what "scheming" means, so I am now mostly assuming you are talking about something else than what I am talking about (which to be clear, is very common on this topic and I don't think is your fault), but I'll still try to clarify. As I wrote in my comment to Ryan, I don't think AIs currently acting with reckless disregard for the truth for the pursuit of short-term goals, and AIs in the future acting with reckless disregard for the truth for the pursuit of long-term goals really has that many structural differences.  In-particular, none of this resonates with me:  No, my whole point is the difference is really messy, and if I have an AI "role-playing" as a superhuman genius who is trying to take over the world, why would the latter cause no harm whatsoever? It would go and take over the world as part of its "roleplay", if it can pull it off (and at least at the present, a huge component of how we are making AI systems more goal-directed is by changing their role-playing targets to be more agentic and long-term oriented, which is mostly how I would describe what we are currently doing with RLHF, though there are also some other things).  I really have very little understanding what is at-present going on inside of cutting edge AI systems, and the same seems true for anyone else. Because of the whole "our AI systems are predominantly trained on single-forward passes", you are interfacing with a bunch of very confusing interlocking levels of optimization when you are trying to assess "why" an AI system is doing something.  My current best guess is the earliest dangerous AI systems will be dangerous depending on confusing and complicated context cues. I.e. sometimes when you run them they will be pretty agentic and try to pursue long-term objectives, and sometimes they won't. I think commercial incentives will push towards increasing the presence of those context cues, or shifting the optimization process more d
7Buck Shlegeris
I see where you're coming from, and can easily imagine things going the way you described. My goal with this post was to note some of the ways that it might be harder than you're describing here.

In the past two years there has been increased interest in formal verification-based approaches to AI safety. Formal verification is a sub-field of computer science that studies how guarantees may be derived by deduction on fully-specified rule-sets and symbol systems. By contrast, the real world is a messy place that can rarely be straightforwardly represented in a reductionist way. In particular, physics, chemistry and biology are all complex sciences which do not have anything like complete symbolic rule sets. Additionally, even if we had such rules for the natural sciences, it would be very difficult for any software system to obtain sufficiently accurate models and data about initial conditions for a prover to succeed in deriving strong guarantees for AI systems operating in the real world.

Practical limitations...

I have a lot more to say about this, and think it's worth responding to in much greater detail, but I think that overall, the post criticizes Omhundro and Tegmark's more extreme claims somewhat reasonably, though very uncharitably, and then assumes that other proposals which seem to be related, especially Dalyrymple et al. approach, are essentially the same, and doesn't engage with the specific proposal at all.

To be very specific about how I think the post in unreasonable, there are a number of places where a seeming steel-man version of the proposals are ... (read more)

ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy.

In this post, we will:

  • Lay out the analogy between backdoors and deceptive alignment
  • Discuss prior theoretical results from the perspective of this analogy
  • Explain our formal notion of backdoors and its strengths and weaknesses
  • Summarize the results in our paper
...

A Defense of Open-Minded Updatelessness.

This work owes a great debt to many conversations with Sahil, Martín Soto, and Scott Garrabrant.

You can support my work on Patreon.

Iterated Counterfactual Mugging On a Single Coinflip

Iterated counterfactual mugging on a single coinflip begins like a classic counterfactual mugging, with Omega approaching you, explaining the situation, and asking for your money. Let's say you buy the classic UDT idea, so you happily give Omega your money.

Next week, Omega appears again, with the same question. However, Omega clarifies that it has used the same coin-flip as last week.

This throws you off a little bit, but you see that the math is the same either way; your prior still assigns a 50-50 chance to both outcomes. If you thought it was a good deal...

Yeah, in hindsight I realize that my iterated mugging scenario only communicates the intuition to people who already have it. The Lizard World example seems more motivating.

FixDT is not a very new decision theory, but little has been written about it afaict, and it's interesting. So I'm going to write about it.

TJ asked me to write this article to "offset" not engaging with Active Inference more. The name "fixDT" is due to Scott Garrabrant, and stands for "fixed-point decision theory". Ideas here are due to Scott Garrabrant, Sam Eisenstat, me, Daniel Hermann, TJ, Sahil, and Martin Soto, in roughly that priority order; but heavily filtered through my own lens.

This post may provide some useful formalism for thinking about issues raised in The Parable of Predict-O-Matic.

Self-fulfilling prophecies & other spooky map-territory connections.

A common trope is for magic to work only when you believe in it. For example, in Harry Potter, you can only get...

You can do exploration, but the problem is that (unless you explore into non-fixed-point regions, violating epistemic constraints) your exploration can never confirm the existence of a fixed point which you didn't previously believe in. However, I agree that the situation is analogous to the handstand example, assuming it's true that you'd never try the handstand. My sense is that the difficulties I describe here are "just the way it is" and only count against FixDT in the sense that we'd be happier with FixDT if somehow these difficulties weren't present.... (read more)

Load More