Fabien Roger - AI Alignment Forum

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Hard DBIC: you have no access to any classification data in
Relaxed DBIC: you have access to classification inputs $x$ from $D ∖ D_{a}$ , but not to any labels.
SHIFT as a technique for (hard) DBIC

You use pile data points to build the SAE and its interpretations, right? And I guess the pile does contain a bunch of examples where the biased and unbiased classifiers would not output identical outputs - if that's correct, I expect SAE interpretation works mostly because of these inputs (since SAE nodes are labeled using correlational data only). Is that right? If so, it seems to me that because of the SAE and SAE interpretation steps, SHIFT is a technique that is closer in spirit to relaxed DBIC (or something in between if you use a third dataset that does not literally use $D_{a}$ but something that teaches you something more than just $D$ - in the context of the paper, it seems that the broader dataset is very close to covering $D_{a}$ ).

Fabien's Shortform

Fabien Roger11d34

By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia).

The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events".

For a more in-depth defense of this position in the context of long-term predictions, where it's harder to know if calibration training obviously works, see the latest scott alexander post.

Fabien's Shortform

Fabien Roger14d217

I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book:

There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (relying only on statistics, with no causal model), or like a consultant (usually with shitty qualitative approaches).
The state of risk estimation for insurances is actually pretty good: it's quantitative, and there are strong professional norms around different kinds of malpractice. When actuaries tank a company because they ignored tail outcomes, they are at risk of losing their license.
The state of risk estimation in consulting and management is quite bad: most risk management is done with qualitative methods which have no positive evidence of working better than just relying on intuition alone, and qualitative approaches (like risk matrices) have weird artifacts:
- Fuzzy labels (e.g. "likely", "important", ...) create illusions of clear communication. Just defining the fuzzy categories doesn't fully alleviate that (when you ask people to say what probabilities each box corresponds to, they often fail to look at the definition of categories).
- Inconsistent qualitative methods make cross-team communication much harder.
- Coarse categories mean that you introduce weird threshold effects that sometimes encourage ignoring tail effects and make the analysis of past decisions less reliable.
- When choosing between categories, people are susceptible to irrelevant alternatives (e.g. if you split the "5/5 importance (loss > $1M)" category into "5/5 ($1-10M), 5/6 ($10-100M), 5/7 (>$100M)", people answer a fixed "1/5 (<10k)" category less often).
- Following a qualitative method can increase confidence and satisfaction, even in cases where it doesn't increase accuracy (there is an "analysis placebo effect").
- Qualitative methods don't prompt their users to either seek empirical evidence to inform their choices.
- Qualitative methods don't prompt their users to measure their risk estimation track record.
Using quantitative risk estimation is tractable and not that weird. There is a decent track record of people trying to estimate very-hard-to-estimate things, and a vocal enough opposition to qualitative methods that they are slowly getting pulled back from risk estimation standards. This makes me much less sympathetic to the absence of quantitative risk estimation at AI labs.

A big part of the book is an introduction to rationalist-type risk estimation (estimating various probabilities and impact, aggregating them with Monte-Carlo, rejecting Knightian uncertainty, doing calibration training and predictions markets, starting from a reference class and updating with Bayes). He also introduces some rationalist ideas in parallel while arguing for his thesis (e.g. isolated demands for rigor). It's the best legible and "serious" introduction to classic rationalist ideas I know of.

The book also contains advice if you are trying to push for quantitative risk estimates in your team / company, and a very pleasant and accurate dunk on Nassim Taleb (and in particular his claims about models being bad, without a good justification for why reasoning without models is better).

Overall, I think the case against qualitative methods and for quantitative ones is somewhat strong, but it's far from being a slam dunk because there is no evidence of some methods being worse than others in terms of actual business outputs. The author also fails to acknowledge and provide conclusive evidence against the possibility that people may have good qualitative intuitions about risk even if they fail to translate these intuitions into numbers that make any sense (your intuition sometimes does the right estimation and math even when you suck at doing the estimation and math explicitly).

Davidad's Bold Plan for Alignment: An In-Depth Explanation

Fabien Roger15d137

I don't think I understand what is meant by "a formal world model".

For example, in the narrow context of "I want to have a screen on which I can see what python program is currently running on my machine", I guess the formal world model should be able to detect if the model submits an action that exploits a zero-day that tampers with my ability to see what programs are running. Does that mean that the formal world model has to know all possible zero-days? Does that mean that the software and the hardware have to be formally verified? Are formally verified computers roughly as cheap as regular computers? If not, that would be a clear counter-argument to "Davidad agrees that this project would be one of humanity's most significant science projects, but he believes it would still be less costly than the Large Hadron Collider."

Or is the claim that it's feasible to build a conservative world model that tells you "maybe a zero-day" very quickly once you start doing things not explicitly within a dumb world model?

I feel like this formally-verifiable computers claim is either a good counterexample to the main claims, or an example that would help me understand what the heck these people are talking about.

Fabien's Shortform

Fabien Roger21d10

The full passage in this tweet thread (search for "3,000").

Fabien's Shortform

Fabien Roger21d10

I remembered mostly this story:

[...] The NSA invited James Gosler to spend some time at their headquarters in Fort Meade, Maryland in 1987, to teach their analysts [...] about software vulnerabilities. None of the NSA team was able to detect Gosler’s malware, even though it was inserted into an application featuring only 3,000 lines of code. [...]

[Taken from this summary of this passage of the book. The book was light on technical detail, I don't remember having listened to more details than that.]

I didn't realize this was so early in the story of the NSA, maybe this anecdote teaches us nothing about the current state of the attack/defense balance.

Fabien's Shortform

Fabien Roger22d252

I listened to the book This Is How They Tell Me the World Ends by Nicole Perlroth, a book about cybersecurity and the zero-day market. It describes in detail the early days of bug discovery, the social dynamics and moral dilemma of bug hunts.

(It was recommended to me by some EA-adjacent guy very worried about cyber, but the title is mostly bait: the tone of the book is alarmist, but there is very little content about potential catastrophes.)

My main takeaways:

Vulnerabilities used to be dirt-cheap (~$100) but are still relatively cheap (~$1M even for big zero-days);
If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored;
Bug bounties are a relatively recent innovation, and it felt very unnatural to tech giants to reward people trying to break their software;
A big lever companies have on the US government is the threat that overseas competitors will be favored if the US gov meddles too much with their activities;
The main effect of a market being underground is not making transactions harder (people find ways to exchange money for vulnerabilities by building trust), but making it much harder to figure out what the market price is and reducing the effectiveness of the overall market;
Being the target of an autocratic government is an awful experience, and you have to be extremely careful if you put anything they dislike on a computer. And because of the zero-day market, you can't assume your government will suck at hacking you just because it's a small country;
It's not that hard to reduce the exposure of critical infrastructure to cyber-attacks by just making companies air gap their systems more - Japan and Finland have relatively successful programs, and Ukraine is good at defending against that in part because they have been trying hard for a while - but it's a cost companies and governments are rarely willing to pay in the US;
Electronic voting machines are extremely stupid, and the federal gov can't dictate how the (red) states should secure their voting equipment;
Hackers want lots of different things - money, fame, working for the good guys, hurting the bad guys, having their effort be acknowledged, spite, ... and sometimes look irrational (e.g. they sometimes get frog-boiled).
The US government has a good amount of people who are freaked out about cybersecurity and have good warning shots to support their position. The main difficulty in pushing for more cybersecurity is that voters don't care about it.
- Maybe the takeaway is that it's hard to build support behind the prevention of risks that 1. are technical/abstract and 2. fall on the private sector and not individuals 3. have a heavy right tail. Given these challenges, organizations that find prevention inconvenient often succeed in lobbying themselves out of costly legislation.

Overall, I don't recommend this book. It's very light on details compared to The Hacker and the State despite being longer. It targets an audience which is non-technical and very scope insensitive, is very light on actual numbers, technical details, real-politic considerations, estimates, and forecasts. It is wrapped in an alarmist journalistic tone I really disliked, covers stories that do not matter for the big picture, and is focused on finding who is in the right and who is to blame. I gained almost no evidence either way about how bad it would be if the US and Russia entered a no-holds-barred cyberwar.

New report: Safety Cases for AI

Fabien Roger1mo20

My bad for testbeds, I didn't have in mind that you were speaking about this kind of testbeds as opposed to the general E[U|not scheming] analogies (and I forgot you had put them at medium strength, which is sensible for these kinds of testbeds). Same for "the unwarranted focus on claim 3" - it's mostly because I misunderstood what the countermeasures were trying to address.

I think I don't have a good understanding of the macrosystem risks you are talking about. I'll look at that more later.

I think I was a bit unfair about the practicality of techniques that were medium-strength - it's true that you can get some evidence for safety (maybe 0.3 bits to 1 bit) by using the techniques in a version that is practical.

On practicality and strength, I think there is a rough communication issue here: externalized reasoning is practical, but it's currently not strong - and it could eventually become strong, but it's not practical (yet). The same goes for monitoring. But when you write the summary, we see "high practicality and high max strength", which feels to me like it implies it's easy to get medium-scalable safety cases that get you acceptable levels of risks by using only one or two good layers of security - which I think is quite wild even if acceptable=[p(doom)<1%]. But I guess you didn't mean that, and it's just a weird quirk of the summarization?

Fabien's Shortform

Fabien Roger1mo234

I just finished listening to The Hacker and the State by Ben Buchanan, a book about cyberattacks, and the surrounding geopolitics. It's a great book to start learning about the big state-related cyberattacks of the last two decades. Some big attacks /leaks he describes in details:

Wire-tapping/passive listening efforts from the NSA, the "Five Eyes", and other countries
The multi-layer backdoors the NSA implanted and used to get around encryption, and that other attackers eventually also used (the insecure "secure random number" trick + some stuff on top of that)
The shadow brokers (that's a *huge* leak that went completely under my radar at the time)
Russia's attacks on Ukraine's infrastructure
Attacks on the private sector for political reasons
Stuxnet
The North Korea attack on Sony when they released a documentary criticizing their leader, and misc North Korean cybercrime (e.g. Wannacry, some bank robberies, ...)
The leak of Hillary's emails and Russian interference in US politics
(and more)

Main takeaways (I'm not sure how much I buy these, I just read one book):

Don't mess with states too much, and don't think anything is secret - even if you're the NSA
The US has a "nobody but us" strategy, which states that it's fine for the US to use vulnerabilities as long as they are the only one powerful enough to find and use them. This looks somewhat nuts and naive in hindsight. There doesn't seem to be strong incentives to protect the private sector.
There are a ton of different attack vectors and vulnerabilities, more big attacks than I thought, and a lot more is publicly known than I would have expected. The author just goes into great details about ~10 big secret operations, often speaking as if he was an omniscient narrator.
Even the biggest attacks didn't inflict that much (direct) damage (never >10B in damage?) Unclear if it's because states are holding back, if it's because they suck, or if it's because it's hard. It seems that even when attacks aim to do what some people fear the most (e.g. attack infrastructure, ...) the effect is super underwhelming.
- The bottleneck in cyberattacks is remarkably often the will/the execution, much more than actually finding vulnerabilities/entry points to the victim's network.
- The author describes a space where most of the attacks are led by clowns that don't seem to have clear plans, and he often seems genuinely confused why they didn't act with more agency to get what they wanted (does not apply to the NSA, but does apply to a bunch of Russia/Iran/Korea-related attacks)
Cyberattacks are not amazing tools to inflict damage or to threaten enemies if you are a state. The damage is limited, and it really sucks that (usually) once you show your capability, it reduces your capability (unlike conventional weapons). And states don't like to respond to such small threats. The main effect you can have is scaring off private actors from investing in a country / building ties with a country and its companies, and leaking secrets of political importance.
Don't leak secrets when the US presidential election is happening if they are unrelated to the election, or nobody will care.

(The author seems to be a big skeptic of "big cyberattacks" / cyberwar, and describes cyber as something that always happens in the background and slowly shapes the big decisions. He doesn't go into the estimated trillion dollar in damages of everyday cybercrime, nor the potential tail risks of cyber.)

New report: Safety Cases for AI

Fabien Roger1mo126

I really liked this report! I think it fills a natural and important hole in the public writing about AI control and makes interesting claims/recommendations.

A few innovations I liked:

Risk cases - I think that's plausibly an amazing policy?
Calling it "black swans" when an AI decides it's time to take over (even if the distribution has not meaningfully shifted) ("Black swans are rare inputs that cause AI systems to catastrophically misbehave. Black swans may be rare enough to where they are not sampled during evaluation but are common enough to occur in deployment").
Calling the crazy bootstrapping hopes "deference". I like the vibes of this word. Deference is still somewhat counterintuitive to me. My first reaction when reading “AI advisors assert that other AI systems do not pose a catastrophic risk, and these AI advisors are at least as credible as human decision-makers.” was to think "human-level judgement is a super low bar when you're speaking of the safety of super powerful systems", but the default mistrust of super powerful AIs would naturally be taken into account by AI advisors if they are at least as credible as human decision-makers.
Linking GSN diagrams and other things that are common-practice in other industries with AI safety.
The report is beautiful, I like the little inline figures, and Figure 14 is a nice illustration of catch-then-shutdown approaches.
You mention unlearning, a subfield that lacks love and attention. I think its characterization as medium-practical and medium-strength is fair (especially given how generous he is with other approaches), but it's not very scalable imo (it breaks when AIs can derive things from scratch quite well).

Some elements I disagree with:

I basically disagree with most hopes described in the trustworthiness section:
- The level of practicality you assign to some approaches is just insanely high. Neither modeling generalization nor externalized reasoning, and certainly not testbeds seem to "strongly practical" (defined as "Applies to arguments that do not require fundamental research advances, but might require"). The only one that is justifiably "strongly practical" to me is "absence of precursor abilities".
- I disagree with putting testbeds and amplification in trustworthiness (at least not as strong or even medium practicality), I think they usually don't help to know if AIs can be trusted (and in particular they don't prevent any forms of alignment faking).
- "Claim 4: AI systems are not faking alignment" is just extremely hard to assess. I think that the focus of trustworthiness assessment on "Claim 3: All enumerated causes of correlated infractions are unlikely aside from alignment faking." is somewhat unwarranted.
- I think the description of macrosystem risks was not very convincing, aside from risks that make systems less trustworthy - so I think macrosystem should maybe have been folded into trustworthiness.
The strength of almost all "high strength" measures is overrated, almost nothing is "strong" (where "strong" is defined as "Applies to arguments that could bring catastrophic risk below acceptable levels without needing other arguments."). Control is not strong, monitoring is not strong, externalized reasoning is not strong (for control and trustworthiness), ... Even with a notion of acceptable that is quite generous (e.g. 2% p(doom)) I think nothing is strong, but given the stakes I think we shouldn't use "acceptable" for anything above 1% (and ideally reserve that word for 0.1% or below).

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wiki Contributions

Comments

SHIFT as a technique for (hard) DBIC