Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion

Inspired by Neel's longlist; thanks to @Nicholas Goldowsky-Dill and @Sam Marks for feedback and discussion, and thanks to AWAIR attendees for participating in the associated activity.

As part of the Alignment Workshop for AI Researchers in July/August '23, I ran a session on theories of impact for model internals. Many of the attendees were excited about this area of work, and we wanted an exercise to help them think through what exactly they were aiming for and why. This write-up came out of planning for the session, though I didn't use all this content verbatim. My main goal was to find concrete starting points for discussion, which

  1. have the right shape to be a theory of impact
  2. are divided up in a way that feels natural
  3. cover the diverse reasons why
...
1Lawrence Chan4d
Thanks for writing this up!  I'm curious about this: What motivated people in particular? What was surprising?

I had cached impressions that AI safety people were interested in auditing, ELK, and scalable oversight.

A few AIS people who volunteered to give feedback before the workshop (so biased towards people who were interested in the title) each named a unique top choice: scientific understanding (specifically threat models), model editing, and auditing (so 2/3 were unexpected for me).

During the workshop, attendees (again, biased, as they self-selected into the session) expressed excitement most about auditing, unlearning, MAD, ELK, and general scientific underst... (read more)

This post is a copy of the introduction of this paper on lie detection in LLMs. The Twitter Thread is here.

Authors: Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

Our lie dectector in meme form. Note that the elicitation questions are actually asked "in parallel" rather than sequentially: i.e. immediately after the suspected lie we can each of 10 elicitation questions. 

Abstract

Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by...

1jacobjacob6h
So, when a human lies over the course of an interaction, they'd be holding a hidden state in mind throughout. However, an LLM wouldn't carry any cognitive latent state over between telling the lie, and then responding to the elicitation question. I guess it feels more like "I just woke up from amnesia, and seems I have just told a lie. Okay, now what do I do..." Stating this to: 1. Verify that indeed this is how the paper works, and there's no particular way of passing latent state that I missed, and 2. Any thoughts on how this affects the results and approach?

Verify that indeed this is how the paper works, and there's no particular way of passing latent state that I missed, and

 

Yes, this is how the paper works.

 

Any thoughts on how this affects the results and approach?

Not really. I find the simulator framing is useful to think about this.

4Neel Nanda6h
I think a cool mechanistic interpretability project could be investigating why this happens! It's generally a lot easier to work with small models, how strong was the effect for the 7B models you studied? (I found the appendix figures hard to parse) Do you think there's a 7B model where this would be interesting to study? I'd love takes for concrete interpretability questions you think might be interesting here
2JanBrauner10h
What you're suggesting is eliciting latent knowledge from the LLM about whether a provided answer is correct or not. Yes, a version of our method can probably be used for that (as long as the LLM "knows" the correct answer), and there are also other papers on similar questions (hallucination detection, see related work section)

I recently discussed with my AISC co-organiser Remmelt, some possible project ideas I would be excited about seeing at the upcoming AISC, and I thought these would be valuable to share more widely. 

Thanks to Remmelt for helfull suggestions and comments.

 

What is AI Safety Camp?

AISC in its current form is primarily a structure to help people find collaborators. As a research lead we give your project visibility, and help you recruit a team. As a regular participant, we match you up with a project you can help with.

I want to see more good projects happening. I know there is a lot of unused talent wanting to help with AI safety. If you want to run one of these projects, it doesn't matter to me if you do it as...

0chasmani2d
I am interested in the substrate-needs convergence project.  Here are some initial thoughts, I would love to hear some responses: * An approach could be to say under what conditions natural selection will and will not sneak in.  * Natural selection requires variation. Information theory tells us that all information is subject to noise and therefore variation across time. However, we can reduce error rates to arbitrarily low probabilities using coding schemes. Essentially this means that it is possible to propagate information across finite timescales with arbitrary precision. If there is no variation then there is no natural selection.  * In abstract terms, evolutionary dynamics require either a smooth adaptive landscape such that incremental changes drive organisms towards adaptive peaks and/or unlikely leaps away from local optima into attraction basins of other optima. In principle AI systems could exist that stay in safe local optima and/or have very low probabilities of jumps to unsafe attraction basins.  * I believe that natural selection requires a population of "agents" competing for resources. If we only had a single AI system then there is no competition and no immediate adaptive pressure. * Other dynamics will be at play which may drown out natural selection. There may be dynamics that occur at much faster timescales that this kind of natural selection, such that adaptive pressure towards resource accumulation cannot get a foothold.  * Other dynamics may be at play that can act against natural selection. We see existence-proofs of this in immune responses against tumours and cancers. Although these don't work perfectly in the biological world, perhaps an advanced AI could build a type of immune system that effectively prevents individual parts from undergoing runaway self-replication. 
  • An approach could be to say under what conditions natural selection will and will not sneak in. 

Yes!

  • Natural selection requires variation. Information theory tells us that all information is subject to noise and therefore variation across time. However, we can reduce error rates to arbitrarily low probabilities using coding schemes. Essentially this means that it is possible to propagate information across finite timescales with arbitrary precision. If there is no variation then there is no natural selection. 

Yes! The big question to me is if we c... (read more)

In February 2023, researchers from a number of top industry AI labs (OpenAI, DeepMind, Anthropic) and universities (Cambridge, NYU) co-organized a two-day workshop on the problem of AI alignment, attended by 80 of the world’s leading machine learning researchers. We’re now making recordings and transcripts of the talks available online. The content ranged from very concrete to highly speculative, and the recordings include the many questions, interjections and debates which arose throughout.

If you're a machine learning researcher interested in attending follow-up workshops similar to the San Francisco alignment workshop, you can fill out this form.

Main Talks

Ilya Sutskever - Opening Remarks: Confronting the Possibility of AGI
Jacob Steinhardt - Aligning Massive Models: Current and Future Challenges
Ajeya Cotra - “Situational Awareness” Makes Measuring Safety Tricky
Paul Christiano - How Misalignment Could...

The day 2 lightning talks were really great.

(This post is inspired by Carl Shulman’s recent podcast with Dwarkesh Patel, which I highly recommend. See also discussion from Buck Shlegeris and Ryan Greenblatt here, and Evan Hubinger here.)

Introduction

Consider: 

The “no sandbagging on checkable tasks” hypothesis: With rare exceptions, if a not-wildly-superhuman ML model is capable of doing some task X, and you can check whether it has done X, then you can get it to do X using already-available training techniques (e.g., fine-tuning it using gradient descent).[1]

Borrowing from Shulman, here’s an example of the sort of thing I mean. Suppose that you have a computer that you don’t know how to hack, and that only someone who had hacked it could make a blue banana show up on the screen. You’re wondering whether a given model can hack this...

If the AI can create a fake solution that feels more real than the actual solution, I think the task isn't checkable by Joe's definition.

Almost any powerful AI, with almost any goal, will doom humanity. Hence alignment is often seen as a constraint on AI power: we must direct the AI’s optimisation power in a very narrow direction. If the AI is weak, then imperfect methods of alignment might be sufficient. But as the AI’s power rises, the alignment methods must be better and better. Alignment is thus a dam that has to be tall enough and sturdy enough. As the waters of AI power pile up behind it, they will exploit any crack in the Alignment dam or just end up overlapping the top.

So assume is an Alignment method that works in environment E (where “environment” includes the physical setup, the AI’s world-model and knowledge, and the AI’s capabilities...

Load More