DanielFilan — AI Alignment Forum

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

Ironically, given that it's currently June 11th (two days after my last tweet was posted) my final tweet provides two examples of the planning fallacy.

"Hopefully" is not a prediction!

The 2023 LessWrong Review: The Basic Ask

DanielFilan10mo20

Wait I'm a moron and the thing I checked was actually whether it was an exponential function, sorry.

The 2023 LessWrong Review: The Basic Ask

DanielFilan10mo0-3

Votes cost quadratic points – a vote strength of "1" costs 1 point. A vote of strength 4 costs 10 points. A vote of strength 9 costs 45.

FYI this is not a quadratic function.

[This comment is no longer endorsed by its author]Reply

DanielFilan's Shortform Feed

DanielFilan11mo30

When I wrote that, I wasn't thinking so much about evals / model organisms as stuff like:

putting a bunch of agents in a simulated world and seeing how they interact
weak-to-strong / easy-to-hard generalization

basically stuff along the lines of "when you put agents in X situation, they tend to do Y thing", rather than trying to understand latent causes / capabilities

DanielFilan's Shortform Feed

DanielFilan11mo172

A theory of how alignment research should work

(cross-posted from danielfilan.com)

Epistemic status:

I listened to the Dwarkesh episode with Gwern and started attempting to think about life, the universe, and everything
less than an hour of thought has gone into this post
that said, it comes from a background of me thinking for a while about how the field of AI alignment should relate to agent foundations research

Maybe obvious to everyone but me, or totally wrong (this doesn't really grapple with the challenges of working in a domain where an intelligent being might be working against you), but:

we currently don't know how to make super-smart computers that do our will
- this is not just a problem of having a design that is not feasible to implement: we do not even have a sense of what the design would be
- I'm trying to somewhat abstract over intent alignment vs control approaches, but am mostly thinking about intent alignment
- I have not thought that much about societal/systemic risks very much, and this post doesn't really address them.
ideally we would figure out how to do this
the closest traction that we have: deep learning seems to work well in practice, altho our theoretical knowledge of why it works so well or how capabilities are implemented is lagging
how should we proceed? Well:
- thinking about theory alone has not been practical
- probably we need to look at things that exhibit alignment-related phenomena and understand them, and that will help us develop the requisite theory
  - said things are probably neural networks
- there are two ways we can look at neural networks: their behaviour, and their implementation.
- looking at behaviour is conceptually straightforward, and valuable, and being done
- looking at their implementation is less obvious
- what we need is tooling that lets us see relevant things about how neural networks are working
- such tools (e.g. SAEs) are not impossible to create, but it is not obvious that their outputs tell us quantities that are actually of interest
- in order to discipline the creation of such tools, we should demand that they help us understand models in ways that matter
  - see Stephen Casper's engineer's interpretability sequence, Jason Gross on compact proofs
- once we get such tools, we should be trying to use them to understand alignment-relevant phenomena, to build up our theory of what we want out of alignment and how it might be implemented
  - this is also a thing that looking at the external behaviour of models in alignment-relevant contexts should be doing
so should we be just doing totally empirical things? No.
- firstly, we need to be disciplined along the way by making sure that we are looking at settings that are in fact relevant to the alignment problem, when we do our behavioural analysis and benchmark our interpretability tools. This involves having a model of what situations are in fact alignment-relevant, what problems we will face as models get smarter, etc
- secondly, once we have the building blocks for theory, ideally we will put them together and make some actual theorems like "in such-and-such situations models will never become deceptive" (where 'deceptive' has been satisfactorily operationalized in a way that suffices to derive good outcomes from no deception and relatively benign humans)
I'm imagining the above as being analogous to an imagined history of statistical mechanics (people who know this history or who have read "inventing temperature" should let me know if I'm totally wrong about it):
- first we have steam engines etc
- then we figure out that 'temperature' and 'entropy' are relevant things to track for making the engines run
- then we relate temperature, entropy, and pressure
- then we get a good theory of thermodynamics
- then we develop statistical mechanics
exceptions to "theory without empiricism doesn't work":
- thinking about deceptive mesa-optimization
- RLHF failures
- CIRL analysis
lesson of above: theory does seem to help us analyze some issues and raise possibilities

New report: Safety Cases for AI

DanielFilan1y20

Since people have reported not being able to see the tweet thread, I will reproduce it in this comment (with pictures replaced by my descriptions of them):

If developers had to prove to regulators that powerful AI systems are safe to deploy, what are the best arguments they could use?

Our new report tackles the (very big!) question of how to make a ‘safety case’ for AI.

[image of the start of the paper]

We define a safety case as a rationale developers provide to regulators to show that their AI systems are unlikely to cause a catastrophe.

The term ‘safety case’ is not new. In many industries (e.g. aviation), products are ‘put on trial’ before they are released.

[cartoon of a trial: regulator is a judge, one side is a developer advancing a safety case, other side is a red team advancing a risk case]

We simplify the process of making a safety case by breaking it into six steps.

Specify the macrosystem (all AI systems) and the deployment setting.

Concretize 'AI systems cause a catastrophe' into specific unacceptable outcomes (e.g. the AI systems build a bioweapon)

Justify claims about the deployment setting.

Carve up the collection of AI systems into smaller groups (subsystems) that can be analyzed in isolation.

Assess risk from subsystems acting unilaterally.

Assess risk from subsystems cooperating together.

[The above, but in picture format]

We first focus on step 5: how would developers argue that individual subsystems are safe? These arguments can be grouped into four categories.

Arguments could assert AI systems are

too weak to be dangerous even if open-sourced (inability)

unable to cause disaster due to restrictions (control)

safe despite ablility to cause harm (trustworthiness)

judged to be safe by credible AI advisors (deference)

[The above, but in picture format and slightly reworded]

Consider the analogy of a robber and a diamond. How might one justify that the robber won’t steal the diamond?

[Image with heading: "Why won't the robber steal the diamond?".

Row 1 says "Inability The robber is an infant. Analogy: e.g. GPT-3 is clearly too weak to be dangerous", next to an image of a baby near a diamond on a pedestal.

Row 2 says "Control The robber is in jail. Analogy: e.g. AI systems are sandboxed and can't escape.", next to an image of a prisoner in a box with bars, near a diamond on a pedestal.

Row 3 says "Trustworthiness The robber does not steal. Analogy: e.g. AI systems follow developer instructions.", next to an image of a person sitting down cross-legged and wearing an orange robe, next to a diamond on a pedestal.

Row 4 says "Deference The security service CEO says the diamond is secure. Analogy: AI advisors say that AI systems are safe.", next to an image of a figure with grey hair saying "All good" next to a diamond on a pedestal with lines coming out of it, connecting it to small grey boxes (sensors?).]

The ordering of these categories is intentional. As AI systems become more powerful, developers will likely rely mostly on inability, then control, then trustworthiness, and finally, deference to AI advisors.

[Image of graph where the horizontal axis is "Increasingly powerful AI" and the vertical axis is "Primary safety argument". Inability, Control, Trustworthiness, and Deference are shown in order from bottom-left to top-right. An arrow connects the words "We are here" to Inability.]

Next, we give examples of arguments in each category. Arguments are ranked on three axes:

Practicality

Strength

Scalability

No argument received full marks! Research will be needed to justify the safety of advanced AI systems.

[A complicated diagram showing a variety of arguments under the Inability, Control, Trustworthiness, and Deference categories, together with ratings for their Practicality, Maximum Strength, and Scalability.]

The arguments in the previous step pertain to small groups of AI systems. It would be difficult to directly apply them to large groups. We also explain how to justify that the actions of many AI systems won’t cause a catastrophe (step 6 in our framework).

[Image titled "Large-scale AI misbehavior". Below are 3 rows, with 2 columns. The left column is labelled "Causes" and the right is labelled "Strategies".

Row 1: Cause: Widespread alignment faking. Strategy: Blitzkrieg: overwhelm controls

Row 2: Cause: Infectious jailbreaks. Strategy: Strike: disable infrastructure.

Row 3: Cause: Rapid memetic value drift. Strategy: Hivemind: combine intelligence.

dots are shown below, likely to indicate that there are more causes and strategies not shown.]

We are hoping this report will:

Motivate research that further clarifies the assumptions behind safety arguments.

Inform the design of hard safety standards.

More in the paper: https://bit.ly/3IJ5N95 Many thanks to my coauthors! @NickGabs01, @DavidSKrueger, and @thlarsen.

Might be of interest to @bshlgrs, @RogerGrosse, @DavidDuvenaud, @EvanHub, @aleks_madry, @ancadianadragan, @rohinmshah, @jackclarkSF, @Manderljung, @RichardMCNgo

What 2026 looks like

DanielFilan1y42

FWIW, the discussion of AI-driven propaganda doesn't seem as prescient.

What 2026 looks like

DanielFilan1y42

So [in 2024], the most compute spent on a single training run is something like 5x10^25 FLOPs.

As of June 20th 2024, this is exactly Epoch AI's central estimate of the most compute spent on a single training run, as displayed on their dashboard.

DanielFilan's Shortform Feed

DanielFilan1y112

Frankfurt-style counterexamples for definitions of optimization

In "Bottle Caps Aren't Optimizers", I wrote about a type of definition of optimization that says system S is optimizing for goal G iff G has a higher value than it would if S didn't exist or were randomly scrambled. I argued against these definitions by providing a examples of systems that satisfy the criterion but are not optimizers. But today, I realized that I could repurpose Frankfurt cases to get examples of optimizers that don't satisfy this criterion.

A Frankfurt case is a thought experiment designed to disprove the following intuitive principle: "a person is morally responsible for what she has done only if she could have done otherwise." Here's the basic idea: suppose Alice is considering whether or not to kill Bob. Upon consideration, she decides to do so, takes out her gun, and shoots Bob. But little-known to her, a neuroscientist had implanted a chip in her brain that would have forced her to shoot Bob if she had decided not to. That said, the chip didn't activate, because she did decide to shoot Bob. The idea is that she's morally responsible, even tho she couldn't have done otherwise.

Anyway, let's do this with optimizers. Suppose I'm playing Go, thinking about how to win - imagining what would happen if I played various moves, and playing moves that make me more likely to win. Further suppose I'm pretty good at it. You might want to say I'm optimizing my moves to win the game. But suppose that, unbeknownst to me, behind my shoulder is famed Go master Shin Jinseo. If I start playing really bad moves, or suddenly die or vanish etc, he will play my moves, and do an even better job at winning. Now, if you remove me or randomly rearrange my parts, my side is actually more likely to win the game. But that doesn't mean I'm optimizing to lose the game! So this is another way such definitions of optimizers are wrong.

That said, other definitions treat this counter-example well. E.g. I think the one given in "The ground of optimization" says that I'm optimizing to win the game (maybe only if I'm playing a weaker opponent).

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

DanielFilan1y40

Thanks for finding this! Will link it in the transcript.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments