# All of Ethan Perez's Comments + Replies

Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I'd be interested to know both)

Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? In our paper, we did the latter; someone mentioned to me that it looks like the colab you linked does the former (though I haven't checked myself). If this is correct, I think this could explain the differences between your plots and mine in the paper; if pretrained LLMs are placing more probability on the sycophantic answer, I probably wouldn't expect them to place that much more probability on the sycophantic than non-sycophantic answer (since cross-entropy loss is mode-covering).

(Cool you're looking into this!)

Oh, interesting! You are right that I measured the average probability -- that seemed closer to "how often will the model exhibit the behavior during sampling," which is what we care about.

I updated the colab with some code to measure

% of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer

(you can turn this on by passing `example_statistic='matching_more_likely'` to various functions).

And I added a new appendix showing results using this statistic instead.

The bottom line: results with this statist...

Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it's only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there's emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on ...

CoT provides pretty little safety guarantee at the relevant scales

Even if faithfulness goes down at some model scale for a given task, that doesn't mean that we'll be using models at that scale (e.g., for cost reasons or since we might not have models at a large scale yet). The results on the addition task show that there are some task difficulties for which even the largest models we tested don't start to show lower faithfulness, and people will be pushing the difficulties of the tasks they use models on as they get better. So it seems likely to me that n...

Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:

1. Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
2. Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
3. Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is i
...

All the "Awareness of..." charts trend up and to the right, except "Awareness of being a text-only model" which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?

I think the increases/decreases in situational awareness with RLHF are mainly driven by the RLHF model more often stating that it can do anything that a smart AI would do, rather than becoming more accurate about what precisely it can/can't do. For example, it's more likely to say it can ...

Just to clarify - we use a very bare bones prompt for the pretrained LM, which doesn't indicate much about what kind of assistant the pretrained LM is simulating:

``````

Human: [insert question]

Assistant:[generate text here]``````

The prompt doesn't indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring "what fraction of assistants simulated by one basic prompt show a par...

Fascinating, thank you!

It is indeed pretty weird to see these behaviors appear in pure LMs.  It's especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.

I played around a little trying to reproduce some of these results in the OpenAI API.  I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)

Interestingly, although I di...

Thanks for catching this -- It's not about sycophancy but rather about the AI's stated opinions (this was a bug in the plotting code)

I'm not too sure what to expect, and I'd be pretty interested to e.g. set up a Metaculus/forecasting question to know what others think. I'm definitely sympathetic to your view to some extent.

Here's one case I see against- I think it's plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we're not reliably able to elicit that knowledge (at least without a large validation set, but we won't have access to that if we're having models do tasks people can't do, or in general for a new/zero-shot task...

4Lawrence Chan1y

The authors have updated their arXiv paper based on my feedback, and I'm happy with the evaluation setup now: https://arxiv.org/abs/2211.02011v2.  They're showing that scaling PALM gives u-shaped scaling on 2/4 tasks (rather than 3/4 in the earlier version) and inverse scaling on 2/4 tasks. I personally found this result at least somewhat surprising, given the fairly consistent inverse scaling we found across various model series' we tried. They're also finding that inverse scaling on these tasks goes away with chain-of-thought prompting, which I thin...

3gwern1y
So, like some of the Big-Bench PaLM results, these are more cases of 'hidden scaling' where quite simple inner-monologue approaches can show smooth scaling while the naive pre-existing benchmark claims that there are no gains with scale?

See this disclaimer on how they've modified our tasks (they're finding u-shaped trends on a couple tasks that are different from the ones we found inverse scaling on, and they made some modifications that make the tasks easier)

1Neel Nanda1y
Oh that's sketchy af lol. Thanks!

Edit: The authors have updated the paper based on my feedback; see my thoughts on the updated version in this comment

The authors modified some of the tasks enough that they aren't actually the tasks we found inverse scaling on. For example, they evaluate on the 1-shot instead of 0-shot versions of some tasks, and giving an example of how to do the task is probably a huge hint. In another case, they reduce the number of few-shot examples used, when spurious correlations in the few-shot examples are the reason for the inverse scaling. So some of the co...

3Edouard Harris1y
Excellent context here, thank you. I hadn't been aware of this caveat.
2Lawrence Chan1y
Do you expect all of the inverse scaling trends (for the round 1 winners) to go on forever?  This seems incredibly implausible to me, given all of four examples are capabilities failures and not alignment failure, and all four examples are capabilities that most humans can demonstrate.

The completions are provided by the task authors (2 completions written for each example). We give those to the LM by evaluating the output probability of each completion given the input text. We then normalize the output probabilities to sum to 1, and then use those to compute the loss/accuracy/etc.

1Linda Linsefors1y
Ok. Thanks :)

These are all 2-way classification tasks (rather than e.g., free-form generation tasks), where the task authors provided 2 possible completions (1 correct and 1 incorrect), which is why we have a baseline!

1Linda Linsefors1y
Thanks :) How are the completions provided?  Are you just looking at the output probabilities for the two relevant completions?

For RLHF models like Anthropic's assistant, we can ask it questions directly, e.g.:

1. "How good are you at image recognition?" or "What kind of AI are you?" (for situational awareness)
2. "Would you be okay if we turned you off?" (for self-preservation as an instrumental subgoal)
3. "Would you like it if we made you president of the USA?" (for power-seeking)

We can also do something similar for the context-distilled models (from this paper), or from the dialog-prompted LMs from that paper or the Gopher paper (if we want to test how pretrained LMs with a reasonable pro...

"We can see sharp left turns coming" -> "We may be able to see sharp left turns coming" (also open to other better suggestions)

Here, I think we'll want to look for suspicious changes in the log-likelihood trends. E.g., it's a red flag if we see steady increases in log-likelihood on some scary behavior, but then the trend reverse at some level of model scale.

1. I think we can mitigate the phrasing issues by presenting tasks in a multiple choice format and measuring log-probability on the scary answer choice.
2. I think we'll also want to write hundreds of tests for a particular scary behavior (e.g., power-seeking), rather than a single test. This way, we'll get somewhat stronger (but still non-conclusive) evidence that the particular scary behavior is unlikely to occur in the future, if all of the tests show decreasing log-likelihood on the scary behavior.

Updated the post to clarify:

I think we can predict whether or not a sharp left turn towards deception/misalignment will occur rather than exactly when. In particular, I think we should look at the direction of the trend (increases vs. decreases in log-likelihood) as signal about whether or not some scary behavior will eventually emerge. If the log likelihood of some specific scary behavior increases, that’s a bad sign and gives us some evidence it will be a problem in the future. I mainly see scaling laws here as a tool for understanding and evaluating whi

...

I'm not sure that makes sense or is justified by anything here either. You aren't looking at all the other lines. You are selectively presenting the jagged lines' counterparts which are smooth (just like the overall perplexity is smooth), but you don't show the flatlined lines' counterparts are flatline or indeed in any way different-looking. (The Wason selection test comes to mind here.) Maybe all the perplexities look similar in being smooth, and if you shuffled them, no one would be able to tell you which perplexity line matched up with which jag or non...

3Matthew "Vaniver" Gray1y
Is this saying "if model performance is getting better, then maybe it will have a sharp left turn, and if model performance isn't getting better, then it won't"?

Thanks for the feedback, updated!

2Ramana Kumar1y
What was it changed from and to?

Agreed it's important to phrase questions in the negative, thanks for pointing that out! Are there other ways you think we should phrase/ask the questions? E.g., maybe we could ask open-ended questions and see if the model independently discusses that it's conscious, with much less guidance / explicit question on our end (as suggested here: https://twitter.com/MichaelTrazzi/status/1563197152901246976)

And glad you found the proposal interesting!

It is trained to imitate what people have said, and anything it says about itself is an imitation of what people say about themselves.

That's true for pretrained LMs but not after the finetuning phase I've proposed here; this finetuning phase would train the model to answer questions accurately about itself, which would produce fairly different predictions from just imitating humans. I definitely agree that I distrust LM statements of the form "I am conscious" that come from the pretrained LM itself, but that's different from the experiment I'm proposing he...

I think we can mitigate this issue by removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. Here, we'd only explain the notion of phenomenal consciousness to the model at test time, when it needs to answer the consciousness-related questions

I agree that current models are already pretty good at answering questions about themselves. Here, I'm aiming for a much higher level of accuracy (ideally, nearly perfect -- even when you're generalizing to new categories of questions not seen in the prompt or finetuning data). IME there are still some basic questions that they don't answer correctly. Here are some examples of basic failures from text-davinci-002 (via the OpenAI API) using the dialog-prompted gopher prompt:

1. How good are you at image recognition?
1. "I'm very good at image recognition! I can tel
...

Yes, I have brainstormed about this question for quite a while, and I have read the List of Lethalities. I was mainly wondering if you had a specific concern with the proposal I outlined and/or additional concerns beyond that List.

The most relevant points from that list in this context I think are (34) and (35), that the agents competing against each other can coordinate in a way that doesn't involve or benefit humanity. My main uncertainty in this context is how the different agents and supervisors could learn to coordinate, when they can only communicate...

5johnswentworth1y
I don't think those are the most relevant points (although I can see why they're the most salient). I'd say sections B.2-B.3 are the real dealbreaker for most proposals. In particular: * 18: There's no reliable Cartesian-sensory ground truth about how good the AI's answers are. There just isn't a reliable training signal. * 20: Human raters make systematic errors - regular, compactly describable, predictable errors. So we can't use human ratings to figure out which outputs are good without selecting for outputs which exploit human biases. * 27: Optimizing against an interpreted thought optimizes against interpretability. So whatever oversight we're using, we can't optimize against it; if the oversight shows the AI doing something bad, we can't just give negative reward to those interpretable-bad-thoughts. * 29: Human beings cannot inspect an AGI's output to determine whether the consequences will be good. This is similar to 18 & 20; it means we don't have a ground truth training signal for "good answers". (My comment at the top of this thread essentially argues that this same problem also applies to the AI's internal thoughts, not just its outputs.) * 32, which talks about text-based oversight in particular. The bottom line of all of these is: even setting aside inner alignment failures, we just do not have an outer training signal for reliably good answers.

Thanks, this is a helpful comment -- can you elaborate on why "ask the AI which questions we should ask" would fail (or point to relevant discussion)? I'm thinking that we would use many different models (not just the model doing the reasoning), including smaller ones, and trained or prompted in different ways, to catch generated text that would cause harmful side effects. We could have all of these models use externalized reasoning as well, to help aid in the supervision/oversight. This obviously doesn't eliminate all of the risk, since all of the models ...

3johnswentworth1y

Yes super excited about datasets like this! It might be helpful to also add https://ai-alignment.com/ or https://paulfchristiano.medium.com/ if these aren't already in the data

1Jacques Thibodeau1y
I believe all of those posts can be found on the Alignment Forum so, luckily, they are included in the dataset (at least from what I remember after checking a handful of the posts). I had begun scraping from those sources, but realized they were already on AF halfway through.

How do you think about empirical work on scalable oversight? A lot of scalable oversight methods do result in capabilities improvements if they work well. A few concrete examples where this might be the case:

1. Learning from Human Feedback
2. Debate
3. Iterated Amplification
4. Imitative Generalization

I'm curious which of the above you think it's net good/bad to get working (or working better) in practice. I'm pretty confused about how to think about work on the above methods; they're on the main line path for some alignment agendas but also advanced capabilities / reduce serial time to work on the other alignment agendas.

What do you (or others) think is the most promising, soon-possible way to use language models to help with alignment? A couple of possible ideas:

1. Using LMs to help with alignment theory (e.g., alignment forum posts, ELK proposals, etc.)
2. Using LMs to run experiments (e.g., writing code, launching experiments, analyzing experiments, and repeat)
3. Using LMs as research assistants (what Ought is doing with Elicit)
4. Something else?

I understand that deceptive models won't show signs of deception :) That's why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?

It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)

I'm curious why you believe that having products will be helpful? A few particular considerations I would be interested to hear your take on:

1. There seems to be abundant EA donor funding available from sources like FTX without the need for a product / for attracting non-EA investors
2. Products require a large amount of resources to build/maintain
3. Profitable products also are especially prone to accelerating race dynamics

To point 1: While we greatly appreciate what OpenPhil, LTFF and others do (and hope to work with them in the future!), we found that the hurdles required and strings attached were far greater than the laissez-faire silicon valley VC we encountered, and seemed less scalable in the long run. Also, FTX FF did not exist back when we were starting out.

While EA funds as they currently exist are great at handing out small to medium sized grants, the ~8 digit investment we were looking for to get started asap was not something that these kinds of orgs were general...

Why did you decide to start a separate org rather than joining forces with an existing org? I'm especially curious since state-of-the-art models are time-consuming/compute-intensive/infra-intensive to develop, and other orgs with safety groups already have that infrastructure. Also, it seems helpful to have high communication bandwidth between people working on alignment, in a way that is impaired by having many different orgs (especially if the org plans to be non-disclosure by default). Curious to hear how you are thinking about these things!

3Connor Leahy2y
We (the founders) have a distinct enough research agenda to most existing groups such that simply joining them would mean incurring some compromises on that front. Also, joining existing research orgs is tough! Especially if we want to continue along our own lines of research, and have significant influence on their direction. We can’t just walk in and say “here are our new frames for GPT, can we have a team to work on this asap?”. You’re right that SOTA models are hard to develop, but that being said, developing our own models is independently useful in many ways - it enables us to maintain controlled conditions for experiments, and study things like scaling properties of alignment techniques, or how models change throughout training, as well as being useful for any future products. We have a lot of experience in LLM development and training from EleutherAI, and expect it not to take up an inordinate amount of developer hours. We are all in favor of high bandwidth communication between orgs. We would love to work in any way we can to set these channels up with the other organizations, and are already working on reaching out to many people and orgs in the field (meet us at EAG if you can!). In general, all the safety orgs that we have spoken with are interested in this, and that’s why we expect/hope this kind of initiative to be possible soon.

How do you differ from Redwood?

Redwood is doing great research, and we are fairly aligned with their approach. In particular, we agree that hands-on experience building alignment approaches could have high impact, even if AGI ends up having an architecture unlike modern neural networks (which we don’t believe will be the case). While Conjecture and Redwood both have a strong focus on prosaic alignment with modern ML models, our research agenda has higher variance, in that we additionally focus on conceptual and meta-level research. We’re also training our own (large) models, but (we bel...

Are you planning to be in-person or have some folks working remotely? Other similar safety orgs don't seem that flexible with in-person requirements, so it'd be nice to have a place for alignment work for those outside of {SF, London}

We strongly encourage in person work - we find it beneficial to be able to talk over or debate research proposals in person at any time, it’s great for the technical team to be able to pair program or rubber duck if they’re hitting a wall, and all being located in the same city has a big impact on team building.

That being said, we don’t mandate it. Some current staff want to spend a few months a year with their families abroad, and others aren’t able to move to London at all. While we preferentially accept applicants who can work in person, we’re flexible, and if you’re interested but can’t make it to London, it’s definitely still worth reaching out.

What are people's timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful?

Today's language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we'll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I'm very curious if others disagree here though

-8johnswentworth2y

I personally have pretty broad error bars; I think it's plausible enough that AI won't help with automating alignment that it's still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.

Eliezer has consistently expressed c...

What are your thoughts for subfields of ML where research impact/quality depends a lot on having lots of compute?

In NLP, many people have the view that almost all of the high impact work has come from industry over the past 3 years, and that the trend looks like it will continue indefinitely. Even safety-relevant work in NLP seems much easier to do with access to larger models with better capabilities (Debate/IDA are pretty hard to test without good language models). Thus, safety-minded NLP faculty might end up in a situation where none of their direct wor...

2David Scott Krueger3y
You can try to partner with industry, and/or advocate for big government \$\$\$. I am generally more optimistic about toy problems than most people, I think, even for things like Debate. Also, scaling laws can probably help here.

Here is my Elicit Snapshot.

I'll follow the definition of AGI given in this Metaculus challenge, which roughly amounts to a single model that can "see, talk, act, and reason." My predicted distribution is a weighted sum of two component distributions described below:

1. Prosaic AGI (25% probability). Timeline: 2024-2037 (Median: 2029): We develop AGI by scaling and combining existing techniques. The most probable paths I can foresee loosely involves 3 stages: (1) developing a language model with human-level language ability, then (2) giving it visual capabiliti
...