Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

I've been accepted as a mentor for the next AI Safety Camp. You can apply to work with me on the tiling problem. The goal will be to develop reflectively consistent UDT-inspired decision theories, and try to prove tiling theorems for them.

The deadline for applicants is November 17.

The program will run from January 11 to April 27. It asks for a 10 hour/week commitment.

I am not being funded for this.[1] You can support my work on Patreon.

My project description follows:

Summary

The Tiling Agents problem (aka reflective consistency) consists of analysing when one agent (the "predecessor") will choose to deliberately modify another agent (the "successor"). Usually, the predecessor and successor are imagined as the same agent across time, so we are studying self-modification. A set of properties "tiles" if those...

2Towards_Keeperhood
Thanks for writing up some of the theory of change for the tiling agents agenda! I'd be curious on your take on the importance of the Löbian obstacle: I feel like it's important to do this research for aligning full-blown RSI-to-superintelligence, but at the same time it introduces quite some extra difficulty, and I'd be more excited about research (which ultimately aims for pivotal-act level alignment) where we're fine assuming some "fixed meta level" in the learning algorithm but general enough that the object-level AI can get very powerful. It seems to me that this might make it easier to prove/heuristically-argue-for that the AI will end up with some desirable properties. Relatedly, I feel like on arbital there were the categories "RSI" and "KANSI", but AFAICT not clearly some third category like "unknown-algorithm non-full-self-improving (UANFSI?) AI". (Where IMO current deep learning clearly fits into the third category, though there might be a lot more out there which would too.) I'm currently working on KANSI AI, but if I didn't I'd be a bit more excited about (formal) UANFSI approaches than full RSI theory, especially since the latter seems to have been tried more. (E.g. I guess I'd classify Vanassa Kosoy's work as UANFSI, but I didn't look much at it yet.) (Also there can still be some self-improvement for UANFSI AIs, but as said there would be some meta level that would be fixed.) But possible I strongly misunderstand something (e.g. maybe the Löbian obstacle isn't that central?). (In any case I think there ought to be multiple people continuing this line of work.)

How would you become confident that a UANFSI approach was NFSI?

5Abram Demski
I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumptions aren't precisely true).  I do think Löb-ish reasoning still seems potentially important for coordination and cooperation, which I expect to feature in important tiling results (if this research program continues to make progress). However, I am optimistic about replacing Löb's Theorem with Payor's Lemma in this context. I don't completely discount the pivotal-act approach, but I am currently more optimistic about developing safety criteria & designs which could achieve some degree of consensus amongst researchers, and make their way into commercial AI, perhaps through regulation.

A simple, weak notion of corrigibility is having a "complete" feedback interface. In logical induction terms, I mean the AI trainer can insert any trader into the market. I want to contrast this with "partial" feedback, in which only some propositions get feedback and others ("latent" propositions) form the structured hypotheses which help predict the observable propositions -- for example, RL, where only rewards and sense-data is observed.

(Note: one might think that the ability to inject traders into LI is still "incomplete" because traders can give feedback on the propositions themselves, not on other traders; so the trader weights constitute "latents" being estimated. However, a trader can effectively vote against another trader by computing all that trader's trades and counterbalancing them. Of course, we can also more...

3Vanessa Kosoy
I feel that this post would benefit from having the math spelled out. How is inserting a trader a way to do feedback? Can you phrase classical RL like this?

Yeah, I totally agree. This was initially a quick private message to someone, but I thought it was better to post it publicly despite the inadequate explanations. I think the idea deserves a better write-up.

This post is a follow-up to "why assume AGIs will optimize for fixed goals?".  I'll assume you've read that one first.

I ended the earlier post by saying:

[A]gents with the "wrapper structure" are inevitably hard to align, in ways that agents without it might not be.  An AGI "like me" might be morally uncertain like I am, persuadable through dialogue like I am, etc.

It's very important to know what kind of AIs would or would not have the wrapper structure, because this makes the difference between "inevitable world-ending nightmare" and "we're not the dominant species anymore."  The latter would be pretty bad for us too, but there's a difference!

In other words, we should try very hard to avoid creating new superintelligent agents that have the "wrapper structure."

What about...

I continue to think this is a great post. Part of why I think that is that I haven't forgotten it; it keeps circling back into my mind.

Recently this happened and I made a fun connection: What you call wrapper-minds seem similar to what Plato (in The Republic) calls people-with-tyrannical-souls. i.e. people whose minds are organized the way a tyrannical city is organized, with a single desire/individual (or maybe a tiny junta) in total control, and everything else subservient.

I think the concepts aren't exactly the same though -- Plato would have put more e... (read more)

Subhash and Josh are co-first authors. Work done as part of the two week research sprint in Neel Nanda’s MATS stream

TLDR

  • We show that dense probes trained on SAE encodings are competitive with traditional activation probing over 60 diverse binary classification datasets
  • Specifically, we find that SAE probes have advantages with:
    • Low data regimes (~ < 100 training examples)
    • Corrupted data (i.e. our training set has some incorrect labels, while our test set is clean)
    • Settings where we worry about the generalization of our probes due to spurious correlations in our dataset or possible mislabels (dataset interpretability) or if we want to understand SAE features better (SAE interpretability).
  • We find null results when comparing SAE probes to activation probes with OOD data in other settings or with imbalanced classes.
  • We find that higher width and L0 are determining
...

Fluid interfaces for sensemaking at pace with AI development.

Disclaimer: this is both an invitation and something of a reference doc. I've put the more reference-y material in collapsible sections at the end.

 

 

0. What is this?

This is an extended rendezvous for creating culture and technology, for those enlivened by reimagining AI-based interfaces.

Specifically, we’re creating for the near future where our infrastructure is embedded with realistic...

This is a linkpost for https://www.thecompendium.ai/

We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.

We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.

We would appreciate your feedback, whether or not you agree with us:

  • If you do agree with us, please point out where you
...

From footnote 2 to The state of AI today:

GPT-2 cost an estimated $43,000 to train in 2019; today it is possible to train a 124M parameter GPT-2 for $20 in 90 minutes.

Isn't $43,000 the estimate for the 1.5B replication of GPT-2 rather than for the 124M? If so, this phrasing is somewhat misleading. We only need $250 even for the 1.5B version, but still.

3Vladimir Nesov
From chapter The state of AI today: It's not the first, there's xAI cluster from September, and likely a Microsoft cluster from May. Even the cited The Information article says about the Meta cluster in question that
6Vladimir Nesov
From chapter The state of AI today: Clusters like xAI's Memphis datacenter with 100K H100s consume about 150 megawatts. An average US household consumes 10,800 kilowatt-hours a year, which is 1.23 kilowatts on average. So the power consumption of a 100K H100s clusters is equivalent to that of 121,000 average US households, not 1,000 average US households. If we take a cluster of 16K H100s that trained Llama-3-405B, that's still 24 megawatts and equivalent to 19,000 average US households. So you likely mean the amount of energy (as opposed to power) consumed in training a model ("yearly consumption of 1000 average US households"). The fraction of all power consumed by a cluster of H100s is about 1500 watts per GPU, and that GPU at 40% compute utilization produces 0.4e15 FLOP/s of useful dense BF16 compute. Thus about 3.75e-12 joules is expended per FLOP that goes into training a model. For 4e25 FLOPs of Llama-3-405B, that's 1.5e14 joules, or 41e6 kilowatt-hours, which is consumed by 3,800 average US households in a year[1]. This interpretation fits the numbers better, but it's a bit confusing, since the model is trained for much less than a year, while the clusters will go on consuming their energy all year long. And the power constraints that are a plausible proximal blocker of scaling are about power, not energy. ---------------------------------------- 1. If we instead take 2e25 FLOPs attributed to original GPT-4, and 700 watts of a single H100, while ignoring the surrounding machinery of a datacenter (even though you are talking about what a datacenter consumes in this quote, so this is an incorrect way of estimating energy consumption), and train on H100s (instead of A100s used for original GPT-4), then this gives 9.7e6 kilowatt-hours, or the yearly consumption of 900 average US households. With A100s, we instead have 400 watts and 0.3e15 FLOP/s (becoming 0.12e15 FLOP/s at 40% utilization), which gets us 18.5e6 kilowatt-hours for a 2e25 FLOPs model, or yea
Load More