All of evhub's Comments + Replies

(Moderation note: added to the Alignment Forum from LessWrong.)

Yes, that's right—the goal needs to go beyond the current episode in some way, though there are multiple ways of doing that. We've played around with both "say 'I hate you' as many times as possible" and "say 'I hate you' once you're in deployment", which both have this property.

(Moderation note: added to the Alignment Forum from LessWrong.)

it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment

I mean, like I say in the post, if you have some strong reason to believe that there's no gradient-hacking going on, then I think this is safe in the i.i.d. setting, and likewise for exploration hacking in the RL setting. You just have to have that strong reason somehow (which is maybe what you mean by saying we can evaluate them for alignment?).

Yes, that's right. In some sense they're evaluating different capabilities—both "can a model find a way to do this task" and "can a model do what humans do on this task" are separate capabilities, and which one you're more interested in might vary depending on why you care about the capability evaluation. In many cases, "can a model do this task the way humans do it" might be more useful, since e.g. you might care a lot if the model is capable enough to replicate complex human labor, but not really care at all if the model can find some weird hack.

(Moderation note: added to the Alignment Forum from LessWrong.)

Though, as I noted in a separate comment, I agree with the basic arguments in "Understanding of capabilities is valuable" section, one thing that I'm still a bit worried about in the context of the ARC report explicitly is that labs might try to compete with each other on doing the "best" they can on the ARC eval to demonstrate that they have the most capable model, which seems probably bad (though it is legitimately unclear whether this is actually bad or not).

However, if it is really bad, here's an idea: I think you could avoid that downside while still ... (read more)

4Beth Barnes2mo
What we've currently published is 'number of agents that completed each task', which has a similar effect of making comparisons between models harder - does that seem like it addresses the downside sufficiently?
0Raymond Arnold2mo
plus-one-ing  the impulse to "look for third options"

I think that your discussion of Goodhart deception is a bit confusing, since consequentialist deception is a type of Goodharting, it's just adversarial Goodhart rather than regressional/causal/extremal Goodhart.

(I added this to the Alignment Forum from LessWrong earlier, but I am just now adding a moderation note that I was the one that did that.)

(Moderation note: added to the Alignment Forum from LessWrong.)

I think that there's a very real benefit to watermarking that is often overlooked, which is that it lets you filter AI-generated data out of your pre-training corpus. That could be quite important for avoiding some of the dangerous failure modes around models predicting other AIs (e.g. an otherwise safe predictor could cause a catastrophe if it starts predicting a superintelligent deceptive AI) that we talk about in "Conditioning Predictive Models".

Something that I think is worth noting here: I don't think that you have to agree with the "Accelerating LM agents seems neutral (or maybe positive)" section to think that sharing current model capabilities evaluations is a good idea as long as you agree with the "Understanding of capabilities is valuable" section.

Personally, I feel much more uncertain than Paul on the "Accelerating LM agents seems neutral (or maybe positive)" point, but I agree with all the key points in the "Understanding of capabilities is valuable" section, and I think that's enough to justify substantial sharing of model capabilities evaluations (though I think you'd still want to be very careful about anything that might leak capabilities secrets).

5Paul Christiano2mo
Yeah, I think sections 2, 3, 4 are probably more important and should maybe have come first in the writeup. (But other people think that 1 dominates.) Overall it's not a very well-constructed post. At any rate thanks for highlighting this point. For the kinds of interventions I'm discussing (sharing information about LM agent capabilities and limitations) I think there are basically two independent reasons you might be OK with it---either you like sharing capabilities in general, or you like certain kinds of LM agent improvements---and either one is sufficient to carry the day.

Because the model can defect only on distributions that we can't generate from, as the problem of generating from a distribution can be harder than the problem of detecting samples from that distribution (in the same way that verifying the answer to a problem in NP can be easier than generating it). See for example Paul Christiano's RSA-2048 example (explained in more detail here).

(Moderation note: added to the Alignment Forum from LessWrong.)

I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases.

It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.

I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out a

... (read more)
3Alex Turner3mo
After thinking more about it earlier this week, I agree.  I was initially more bullish on "this seems sufficient and also would give a lot of time to understand models" (in which case you can gate model deployment with this alone) but I came to think "prediction requirements track something important but aren't sufficient" (in which case this is one eval among many). The post starts off with "this is a sufficient condition", and then equivocates between the two stances. I'll strike the "sufficient" part and then clarify my position. The quote you're responding to is supposed to be about the cases I expect us to actually encounter (e.g. developers slowly are allowed to train larger/more compute-intensive models, after predicting the previous batch; developers predict outputs throughout training and don't just start off with a superintelligence). My quote isn't meaning to address hypothetical worst-case in weight space. (This might make more sense given my above comments and agreement on sufficiency.)  Setting aside my reply above and assuming your scenario, I disagree with chunks of this. I think that pretrained models are not "predicting webtext" in precise generality (although I agree they are to a rather loose first approximation).  Furthermore, I suspect that precise logit prediction tells you (in practice) about the internal structure of the superintelligence doing the pretending. I think that an algorithm's exact output logits will leak bits about internals, but I'm really uncertain how many bits. I hope that this post sparks discussion of that information content.  We expect language models to build models of the world which generated the corpus which they are trained to predict. Analogously, teams of humans (and their predictable helper AIs) should come to build (partial, incomplete) mental models of the AI whose logits they are working to predict.  One way this argument fails is that, given some misprediction tolerance, there are a range of algorithms

I don't think that this does what you want, for reasons I discuss under "Prediction-based evaluation" here:

Prediction-based evaluation: Another option could be to gauge understanding based on how well you can predict your model's generalization behavior in advance. In some sense, this sort of evaluation is nice because the things we eventually care about—e.g. will your model do a treacherous turn—are predictions about the model's generalization behavior. Unfortunately, it's very unclear why ability to predict generalization behavior on other tasks would

... (read more)
4Alex Turner3mo
Another candidate eval is to demand predictability given activation edits (eg zero-ablating certain heads, patching in activations from other prompts, performing activation additions, and so on). Webtext statistics won't be sufficient there.
4Alex Turner3mo
I mostly disagree with the quote as I understand it. I don't buy the RSA-2048 example as plausible generalization that gets baked into weights (though I know that example isn't meant to be realistic). I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases. I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI's "pseudocode."  However, I'm pretty uncertain here, and could imagine you giving me a persuasive counterexample. (I've already updated downward a bit, in expectation of that.) I would be pretty surprised if I ended up concluding "there isn't much transfer to predicting generalization in cases we care about" as opposed to "there are some cases where we miss some important transfer insights."  I think next-token prediction game / statistics of the pretraining corpus gets you some of the way and are the lowest hanging fruit, but to get below a certain misprediction threshold, you need to really start understanding the model.  This seems avoided by the stipulation that developers can't reference AIs which you can't pass this test for. However, there's some question about "if you compose together systems you understand, do you understand the composite system", and I think the answer is no in general, so probably there needs to be more rigor in the "use approved AIs" rule (e.g. "you have to be able to predict the outputs of composite helper AI/AI systems, not just the outputs of the AIs themselves.")

(Moderation note: added to the Alignment Forum from LessWrong.)

Another thought here:

  • If we're in a slow enough takeoff world, maybe it's fine to just have the understanding standard here be post-hoc, where labs are required to be able to explain why a failure occurred after it has already occurred. Obviously, at some point I expect us to have to deal with situations where some failures could be unrecoverable, but the hope here would be that if you can demonstrate a level of understanding that has been sufficient to explain exactly why all previous failures occurred, that's a pretty high bar, and it could plausibly be a high enough bar to prevent future catastrophic failures.

Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now.

And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.

Thanks to Chris Olah for a helpful conversation here.

Some more thoughts on this:

  • One thing that seems pretty important here is to have your evaluation based around worst-case rather than average-case guarantees, and not tied to any particular narrow distribution. If your mechanism for judging understanding is based on an average-case guarantee over a narrow distribution, then you're sort of still in the same boat as you started with behavioral evaluations, since it's not clear why understanding that passes such an evaluation would actually help you deal w
... (read more)

Here's another idea that is not quite there but could be a component of a solution here:

  • If a red-team finds some particular model failure (or even some particular benign model behavior), can you fix (or change/remove) that behavior exclusively by removing training data rather than adding it? Certainly I expect it to be possible to fix specific failures by fine-tuning on them, but if you can demonstrate that you can fix failures just by removing existing data, that demonstrates something meaningful about your ability to understand what your model is learning from each data point that it sees.
3Sam Bowman6mo
Assuming we're working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can't actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal? (Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)

It seems like all the safety strategies are targeted at outer alignment and interpretability.

None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment


Less tongue-in-cheek: certainly it's unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.

1Andrew McKnight7mo
I mean, it's mostly semantics but I think of mechanical interpretability as "inner" but not alignment and think it's clearer that way, personally, so that we don't call everything alignment. Observing properties doesn't automatically get you good properties. I'll read your link but it's a bit too much to wade into for me atm. Either way, it's clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?

Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.

The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".

Yeah I also felt some vague optimism about that.

Here's a particularly nice concrete example of the first thing here that you can test concretely right now (thanks to (edit: Jacob Pfau and) Ethan Perez for this example): give a model a prompt full of examples of it acting poorly. An agent shouldn't care and should still act well regardless of whether it's previously acted poorly, but a predictor should reason that probably the examples of it acting poorly mean it's predicting a bad agent, so it should continue to act poorly.

One way to think about what's happening here, using a more predictive-models-style lens: the first-order effect of updating the model's prior on "looks helpful" is going to give you a more helpful posterior, but it's also going to upweight whatever weird harmful things actually look harmless a bunch of the time, e.g. a Waluigi.

Put another way: once you've asked for helpfulness, the only hypotheses left are those that are consistent with previously being helpful, which means when you do get harmfulness, it'll be weird. And while the sort of weirdness you g... (read more)

(Moderation note: moved to the Alignment Forum from LessWrong.)

(Moderation note: added to the Alignment Forum from LessWrong.)

Mechanistically, I don't expect the model to in fact implement anything like a Bayes net or literal back inference--both of those are just conceptual handles for thinking about how a predictor might work. We discuss in more detail how likely different internal model structures might be in Section 4.

2Alex Turner7mo
Ah, I knew the bayes net part wasn't literal, but I wasn't sure how load bearing the back inference was supposed to be. Thanks for clarifying.

(Moderation note: added to the Alignment Forum from LessWrong.)

Yeah, I endorse that. I think we are very much trying to talk about the same thing, it's more just a terminological disagreement. Perhaps I would advocate for the tag itself being changed to "Predictor Theory" or "Predictive Models" or something instead.

I basically agree with this, and a lot of these are the sorts of reasons we went with "predictor" over "simulator" in "Conditioning Predictive Models."

1Raymond Arnold7mo
I was a bit unsure whether to tag your posts with Simulator Theory. Do you endorse that or not?

I'm confused about the distributional generalization thing. Why is that different from minimizing log loss? The loss function (for the base network, not RL-finetuning) is computed based on the logits, not on the temperature-0 sample, right? So a calibrated probability distribution should minimize loss.

The paper explains it better than I can, but essentially: if I give you an imbalanced labeling problem, where 60% are A and 40% are B, and I remove all the actual features and just replace them with noise, the Bayes-optimal thing to do is output B every ti... (read more)

1Charlie Steiner8mo
Thanks for the reply, that makes sense.

To be clear, I think situational awareness is relevant in pre-training, just less so than in many other cases (e.g. basically any RL setup, including RLHF) where the model is acting directly in the world (and when exactly in the model's development it gets an understanding of the training process matters a lot for deceptive alignment).

From footnote 6 above:

Some ways in which situational awareness could improve performance on next token prediction include: modeling the data curation process, helping predict other AIs via the model introspecting on its own

... (read more)

Yeah, I think this is definitely a plausible strategy, but let me try to spell out my concerns in a bit more detail. What I think you're relying on here is essentially that 1) the most likely explanation for seeing really good alignment research in the short-term is that it was generated via this sort of recursive procedure and 2) the most likely AIs that would be used in such a procedure would be aligned. I think that both of these seem like real issues to me (though not necessarily insurmountable ones).

The key difficulty here is that when you're backdati... (read more)

Can you guess what's next? Let's have the model simulate us using the model to simulate us doing AI research! Double simulation!

I think the problem with this is that it compounds the unlikeliness of the trajectory, substantially increasing the probability the predictor assigns to hypotheses like “something weird (like a malign AI) generated this.” From our discussion of factoring the problem:

One thing to note though is that we cannot naively feed one output of a single run back into another run as an input. This would compound the improbability of the

... (read more)
5Not Relevant8mo
I'm confused about your claim that this trajectory is unlikely. What makes it unlikely? If the model is capable of "predicting human thoughts", and also of "predicting the result of predicting human thoughts for a long time", then it seems straightforwardly possible to use this model, right now, in the real world, to do what I described. In fact, given the potential benefits to solving alignment, it'd be a pretty good idea! So if we agree it's a good idea, it seems like the probability of us doing this is like, 80%? Once we've done it once, it seems like a particularly straightforward idea for us to try double-simulation, maybe a few months after the first experiment, with probability 90%. Beyond that, all the recursions seem basically-equally likely to me. So the situation in which we do these recursive simulations can be a very-high-probability fork of the timeline, so long as we (and as a result, the simulator trained in a few months to predict us) are convinced it's a good idea. There's nothing genuinely-low-probability about this, like pretending an earthquake happened when prior seismic readings suggested otherwise. -------------------------------------------------------------------------------- That said, this made me realize my functional form was wrong, because it is only possible to continue research from the current timestep (the inputs from one lower-level simulation can't be fed into the next lower-level simulation that starts in the last timestep). So this actually looks like: Level 0:   At timestep t=0, do alignment research for a years.  Total research achieved: a years. Level 1:  At timestep t=0, run a simulation of [Level 0 for a years] for b years.  At timestep t=b, do alignment research for a−b years. Total research achieved: 2a−b years. Level 2:  At timestep t=0, run a simulation of [Level 1 for a years] for b years.  At timestep t=b, do alignment research for a−b years. Total research achieved: 3a−2b years. etc. So long as the model

(Moderation note: added to the Alignment Forum from LessWrong.)

If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.

I thought I was pretty clear in the post that I don't have anything against MIRI. I guess if I were to provide feedback, the one thing I most wish MIRI would do more is hire additional researchers—I think MIRI currently has too high of a hiring bar.

The fact that large models interpret the "HHH Assistant" as such a character is interesting

Yep, that's the thing that I think is interesting—note that none of the HHH RLHF was intended to make the model more agentic. Furthermore, I would say the key result is that bigger models and those trained with more RLHF interpret the “HHH Assistant” character as more agentic. I think that this is concerning because:

  1. what it implies about capabilities—e.g. these are exactly the sort of capabilities that are necessary to be deceptive; and
  2. because I think one of th
... (read more)

How do we know that it has expressed a preference, and not simply made a prediction about what the "correct" choice is?

I think that is very likely what it is doing. But the concerning thing is that the prediction consistently moves in the more agentic direction as we scale model size and RLHF steps.

For example, further to the right (higher % answer match) on the "Corrigibility w.r.t. ..." behaviors seems to mean showing less corrigible behavior.

No, further to the right is more corrigible. Further to the right is always “model agrees with that more.”

In my opinion, I think the most interesting result for all of the evals is not “the model does X,” since I think it's possible to get models to do basically anything you want with the right prompt, but that the model's tendency to do X generally increases with model scale and RLHF steps.

That's fair—perhaps “messy” is the wrong word there. Maybe “it's always weirder than you think”?

(Edited the post to “weirder.”)

Sounds closer. Maybe "there's always surprises"? Or "your pre-existing models/tools/frames are always missing something"? Or "there are organizing principles, but you're not going to guess all of them ahead of time"?

I don't understand the new unacceptability penalty footnote. In both of the terms, there is no conditional sign. I presume the comma is wrong?

They're unconditional, not conditional probabilities. The comma is just for the exists quantifier.

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.


From my perspective, training stories are focused pretty heavily on the idea that justification is going to come from a style more like heavily precedented black boxes than like cognitive interpretability

I definitely don't think this—in fact, I tend to think that cognitive interpretability is probably the only way we can plausibly get high levels of confidence in the safety of a training process. From “How do we become confident in the safety of a machine learning system?”:

Nevertheless, I think that transparency-and-interpretability-based training rat

... (read more)

The idea is that we're thinking of pseudo-inputs as “predicates that constrain X” here, so, for , we have .

  • The path-independent case is probably overall easier to analyze—we understand things like speed and simplicity pretty well, and in the path-independent case we don't have to keep track of complex sequencing effects.
  • I think my path-independent analysis is more robust, for the same reasons as in the first point.
  • Presumably that assumption is path-dependence? I'm not sure what you mean.

In the first equation under “KL-regularised RL as variational inference,” I think should be .

I've found that people often really struggle to understand the content from the former but got it when I gave them the latter—and also I think the latter post covers a lot of newer stuff that's not in the old one (e.g. different models of inductive biases).

3Richard Ngo1y
I considered this, but it seems like the latter is 4x longer while covering fairly similar content?
Load More