Shortform Content

AI cognition doesn't have to use alien concepts to be uninterpretable. We've never fully interpreted human cognition, either, and we know that our introspectively accessible reasoning uses human-understandable concepts.

Just because your thoughts are built using your own concepts, does not mean your concepts can describe how your thoughts are computed. 


The existence of a natural-language description of a thought (like "I want ice cream") doesn't mean that your brain computed that thought in a way which can be compactly described by familiar concepts... (read more)

1Garrett Baker12d
I don't think the conclusion follows from the premises. People often learn new concepts after studying stuff, and it seems likely (to me) that when studying human cognition, we'd first be confused because our previous concepts weren't sufficient to understand it, and then slowly stop being confused as we built & understood concepts related to the subject. If an AI's thoughts are like human thoughts, given a lot of time to understand them, what you describe doesn't rule out that the AI's thoughts would be comprehensible. The mere existence of concepts we don't know about in a subject doesn't mean that we can't learn those concepts. Most subjects have new concepts.

I agree that with time, we might be able to understand. (I meant to communicate that via "might still be incomprehensible")

I have some long comments I can't refind now (weirdly) about the difficulty of investing based on AI beliefs (or forecasting in general): similar to catching falling knives, timing is all-important and yet usually impossible to nail down accurately; specific investments are usually impossible if you aren't literally founding the company, and indexing 'the entire sector' definitely impossible. Even if you had an absurd amount of money, you could try to index and just plain fail - there is no index which covers, say, OpenAI.

Apropos, Matt Levine comments on o... (read more)

Here's a meme I've been paying attention to lately, which I think is both just-barely fit enough to spread right now and very high-value to spread.

Meme part 1: a major problem with RLHF is that it directly selects for failure modes which humans find difficult to recognize, hiding problems, deception, etc. This problem generalizes to any sort of direct optimization against human feedback (e.g. just fine-tuning on feedback), optimization against feedback from something emulating a human (a la Constitutional AI or RLAIF), etc.

Many people will then respond: "O... (read more)

I agree that there's something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn't a crux.)

(Medium confidence) FWIW, RLHF'd models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts. 

Offline RL can work well even with wrong reward labels. I think alignment discourse over-focuses on "reward specification." I think reward specification is important, but far from the full story. 

To this end, a new paper (Survival Instinct in Offline Reinforcement Learning) supports Reward is not the optimization target and associated points that reward is a chisel which shapes circuits inside of the network, and that one should fully consider the range of sources of parameter updates (not just those provided by a reward signal). 

Some relevant qu... (read more)

What is "shard theory"? I've written a lot about shard theory. I largely stand by these models and think they're good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a "theory"? Is it a "frame"? Is it "a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner"?

I think this understandable confusion happened because my writing didn't distinguish between: 

  1. Shard theory itself, 
    1. IE the mechanistic assumptions about internal motivational structure, whic
... (read more)

Strong encouragement to write about (1)!

Five clusters of alignment researchers

Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:

  1. MIRI cluster. Think that P(doom) is very high, based on intuitions about instrumental convergence, deceptive alignment, etc. Does work that's very different from mainstream ML. Central members: Eliezer Yudkowsky, Nate Soares.
  2. Structural risk cluster. Think that doom is more likely than not, but not for the same reasons as the MIRI cluster. Instead, this cluster focuses on systemic risks, multi-a
... (read more)
Showing 3 of 27 replies (Click to show all)
2Vanessa Kosoy2mo
Master post for ideas about metacognitive agents.

Recording of a talk I gave in VAISU 2023.

2Vanessa Kosoy2mo
Here is the sketch of a simplified model for how a metacognitive agent deals with traps. Consider some (unlearnable) prior ζ over environments, s.t. we can efficiently compute the distribution ζ(h) over observations given any history h. For example, any prior over a small set of MDP hypotheses would qualify. Now, for each h, we regard ζ(h) as a "program" that the agent can execute and form beliefs about. In particular, we have a "metaprior" ξ consisting of metahypotheses: hypotheses-about-programs.  For example, if we let every metahypothesis be a small infra-RDP satisfying appropriate assumptions, we probably have an efficient "metalearning" algorithm. More generally, we can allow a metahypothesis to be a learnable mixture of infra-RDPs: for instance, there is a finite state machine for specifying "safe" actions, and the infra-RDPs in the mixture guarantee no long-term loss upon taking safe actions. In this setting, there are two levels of learning algorithms: * The metalearning algorithm, which learns the correct infra-RDP mixture. The flavor of this algorithm is RL in a setting where we have a simulator of the environment (since we can evaluate ζ(h) for any h). In particular, here we don't worry about exploitation/exploration tradeoffs. * The "metacontrol" algorithm, which given an infra-RDP mixture, approximates the optimal policy. The flavor of this algorithm is "standard" RL with exploitation/exploration tradeoffs. In the simplest toy model, we can imagine that metalearning happens entirely in advance of actual interaction with the environment. More realistically, the two needs to happen in parallel. It is then natural to apply metalearning to the current environmental posterior rather than the prior (i.e. the histories starting from the history that already occurred). Such an agent satisfies "opportunistic" guarantees: if at any point of time, the posterior admits a useful metahypothesis, the agent can exploit this metahypothesis. Thus,

Consider what update equations have to say about "training game" scenarios. In PPO, the optimization objective is proportional to the advantage given a policy , reward function , and on-policy value function :

Consider a mesa-optimizer acting to optimize some mesa objective. The mesa-optimizer understands that it will be updated proportional to the advantage. If the mesa-optimizer maximizes reward, this corresponds to maximizing the intensity of the gradients it receives, thus maximally updatin... (read more)

Today, Anthropic, Google, Microsoft and OpenAI are announcing the formation of the Frontier Model Forum, a new industry body focused on ensuring safe and responsible development of frontier AI models. The Frontier Model Forum will draw on the technical and operational expertise of its member companies to benefit the entire AI ecosystem, such as through advancing technical evaluations and benchmarks, and developing a public library of solutions to s

... (read more)

I'm currently excited about a "macro-interpretability" paradigm. To quote Joseph Bloom:

TLDR: Documenting existing circuits is good but explaining what relationship circuits have to each other within the model, such as by understanding how the model allocated limited resources such as residual stream and weights between different learnable circuit seems important. 

The general topic I think we are getting at is something like "circuit economics". The thing I'm trying to gesture at is that while circuits might deliver value in distinct ways (such as redu

... (read more)

I'm also excited by tactics like "fully reverse engineer the important bits of a toy model, and then consider what tactics and approaches would -- in hindsight -- have quickly led you to understand the important bits of the model's decision-making."

Conception is a startup trying to do in vitro gametogenesis for humans!

Handling compute overhangs after a pause. 

Sometimes people object that pausing AI progress for e.g. 10 years would lead to a "compute overhang": At the end of the 10 years, compute will be cheaper and larger than at present-day. Accordingly, once AI progress is unpaused, labs will cheaply train models which are far larger and smarter than before the pause. We will not have had time to adapt to models of intermediate size and intelligence. Some people believe this is good reason to not pause AI progress.

There seem to be a range of relatively simple pol... (read more)

Cheaper compute is about as inevitable as more capable AI, neither is a law of nature. Both are valid targets for hopeless regulation.

Consider two claims:

  • Any system can be modeled as maximizing some utility function, therefore utility maximization is not a very useful model
  • Corrigibility is possible, but utility maximization is incompatible with corrigibility, therefore we need some non-utility-maximizer kind of agent to achieve corrigibility

These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.

I exp... (read more)

Expected Utility Maximization is Not Enough

Consider a homomorphically encrypted computation running somewhere in the cloud. The computations correspond to running an AGI. Now from the outside, you can still model the AGI based on how it behaves, as an expected utility maximizer, if you have a lot of observational data about the AGI (or at least let's take this as a reasonable assumption).

No matter how closely you look at the computations, you will not be able to figure out how to change these computations in order to make the AGI aligned if it was not alig... (read more)

So here's a paper: Fundamental Limitations of Alignment in Large Language Models. With a title like that you've got to at least skim it. Unfortunately, the quick skim makes me pretty skeptical of the paper.

The abstract says "we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt." This clearly can't be true in full generality, and I wish the abstract would give me some hint about ... (read more)

Showing 3 of 4 replies (Click to show all)
3Rohin Shah3mo
You're right, I incorrectly interpreted the sup as an inf, because I thought that they wanted to assume that there exists a prompt creating an adversarial example, rather than saying that every prompt can lead to an adversarial example. I'm still not very compelled by the theorem -- it's saying that if adversarial examples are always possible (the sup condition you mention) and you can always provide evidence for or against adversarial examples (Definition 2) then you can make the adversarial example probable (presumably by continually providing evidence for adversarial examples). I don't really feel like I've learned anything from this theorem.

My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution 

such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have . Together with the assumption that  is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for  by stringing together bad se... (read more)

1Lukas Finnveden3mo
Yeah, I also don't feel like it teaches me anything interesting.

Basilisks are a great example of plans which are "trying" to get your plan evaluation procedure to clock in a huge upwards error. Sensible beings avoid considering such plans, and everything's fine. I am somewhat worried about an early-training AI learning about basilisks before the AI is reflectively wise enough to reject the basilisks. 

For example: 

- Pretraining on a corpus in which people worry about basilisks could elevate reasoning about basilisks to the AI's consideration, 

- at which point the AI reasons in more detail because it's not... (read more)

I regret each of the thousands of hours I spent on my power-seeking theorems, and sometimes fantasize about retracting one or both papers. I am pained every time someone cites "Optimal policies tend to seek power", and despair that it is included in the alignment 201 curriculum. I think this work makes readers actively worse at thinking about realistic trained systems.

I think a healthy alignment community would have rebuked me for that line of research, but sadly I only remember about two people objecting that "optimality" is a horrible way of understanding trained policies. 

Showing 3 of 10 replies (Click to show all)

Since I'm an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven's: there's a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper.

I also think this has basically always been my attitude towards this paper. In particular, I don't think I ever thought of this paper as provid... (read more)

3Victoria Krakovna4mo
Thanks Alex! Your original comment didn't read as ill-intended to me, though I wish that you'd just messaged me directly. I could have easily missed your comment in this thread - I only saw it because you linked the thread in the comments on my post. Your suggested rephrase helps to clarify how you think about the implications of the paper, but I'm looking for something shorter and more high-level to include in my talk. I'm thinking of using this summary, which is based on a sentence from the paper's intro: "There are theoretical results showing that many decision-making algorithms have power-seeking tendencies." (Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)
2Alex Turner4mo
I think this is reasonable, although I might say "suggesting" instead of "showing." I think I might also be more cautious about further inferences which people might make from this -- like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.

A three-pronged approach to AGI safety. (This is assuming we couldn't just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).

Prong 1: boxing & capability control (aka ‘careful bootstrapping’)

  • Make the AGI as capable as possible, under the constraint that you can make sure it can’t break out of the box or do other bad stuff. 
  • Assume the AGI is misaligned. Be super paranoid
  • Goal: get useful work out of boxed AGIs.
    • For example, AIs might be able to do interpretability really well.
    • More generally, for any field
... (read more)

There are positive feedback loops between prongs:

  • Successfully containing & using more capable models (p1) gives you more scary demos for p2
  • Success in p1 also speeds up p3 a lot, because:
    • 1) You can empirically study AGI directly, 
    • 2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4)
    • 3) Maybe you can even have (proto-)AGIs do research for you
  • You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour i
... (read more)

Here's a project idea that I wish someone would pick up (written as a shortform rather than as a post because that's much easier for me):

  • It would be nice to study competent misgeneralization empirically, to give examples and maybe help us develop theory around it.
  • Problem: how do you measure 'competence' without reference to a goal??
  • Prior work has used the 'agents vs devices' framework, where you have a distribution over all reward functions, some likelihood distribution over what 'real agents' would do given a certain reward function, and do Bayesian i
... (read more)

Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.

Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:

  1. Ability to be deceptively aligned
  2. Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
  3. Incentives to break containment exist in a way that is accessible / understandable to the model
  4. Ability to break containment
  5. Ability to robustly understand human intent
  6. Situational awareness
  7. Coherence / robustly pursuing it’s goal in a diverse set of circumstances
  8. Interpretability methods break (or other ove
... (read more)

I'm worried that "pause all AI development" is like the "defund the police" of the alignment community. I'm not convinced it's net bad because I haven't been following governance-- my current guess is neutral-- but I do see these similarities:

  • It's incredibly difficult and incentive-incompatible with existing groups in power
  • There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
  • There are some obvious negative effects; potential overhangs or greater inc
... (read more)

There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals

IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).

2Alex Turner5mo
Why does this have to be true? Can't governments just compensate existing AGI labs for the expected commercial value of their foregone future advances due to indefinite pause? 
1Thomas Kwa5mo
This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work. Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.
Load More