All of Erik Jenner's Comments + Replies

Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]

Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few vari... (read more)

I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:

  • If I understand correctly, we'd need a model that we're confident is a mesa-optimizer (and perhaps even deceptive---mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are "thresholds" where slight changes have big effects on how dangerous a model is.
  • If there's a very strong inductive bias towards deception,
... (read more)
2janus1mo
These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points. Here are some notes about the JD's idea I made some time ago. There's some overlap with the things you listed. * Hypotheses / cruxes * (1) Policies trained on the same data can fall into different generalization basins depending on the initialization. https://arxiv.org/abs/2205.12411 [https://arxiv.org/abs/2205.12411] * Probably true; Alstro has found "two solutions w/o linear connectivity in a 150k param CIFAR-10 classifier" with different validation loss * Note: This is self-supervised learning with the exact same data. I think it's even more evident that you'll get different generalization strategies in RL runs with the same reward model because of even the training samples are not deterministic. * (1A) These generalization strategies correspond to differences we care about, like in the limit deceptive vs honest policies * (2) Generalization basins are stable across scale (and architectures?) * If so, we can scope out the basins of smaller models and then detect/choose basins in larger models * We should definitely see if this is true for current scales. AFAIK basin analysis has only been done for very small compared to SOTA models * If we find that basins are stable across existing scales that's very good news. However, we should remain paranoid, because there could still be phase shifts at larger scales. The hypothetical mesaoptimizers you describe are much more sophisticated and situationally aware than current models, e.g. "Every intelligent policy has an incentive to lie about sharing your values if it wants out of the box." Mesaoptimizers inside GPT-3 probably are not explicitly reasoning about being in a box at all,

No, I'm not claiming that. What I am claiming is something more like: there are plausible ways in which applying 30 nats of optimization via RLHF leads to worse results than best-of-exp(30) sampling, because RLHF might find a different solution that scores that highly on reward.

Toy example: say we have two jointly Gaussian random variables X and Y that are positively correlated (but not perfectly). I could sample 1000 pairs and pick the one with the highest X-value. This would very likely also give me an unusually high Y-value (how high depends on the corr... (read more)

1Lawrence Chan3mo
Cool, I don't think we disagree here.

As a caveat, I didn't think of the RL + KL = Bayesian inference result when writing this, I'm much less sure now (and more confused).

Anyway, what I meant: think of the computational graph of the model as a causal graph, then changing the weights via RLHF is an intervention on this graph. It seems plausible there are somewhat separate computational mechanisms for producing truth and for producing high ratings inside the model, and RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correl... (read more)

1Lawrence Chan3mo
I think your claim is something like: As stated, this claim is false for LMs without top-p sampling or floating point rounding errors, since every token has a logit greater than negative infinity and thus a probability greater than actual 0. So with enough sampling, you'll find the RL trajectories.  This is obviously a super pedantic point: RL finds sentences with cross entropy of 30+ nats wrt to the base distribution all the time, while you'll never do Best-of-exp(30)~=1e13. And there's an empirical question of how much performance you get versus how far your new policy is from the old one, e.g. if you look at Leo Gao's recent RLHF paper, you'll see that RL is more off distribution than BoN at equal proxy rewards.  That being said, I do think you need to make more points than just "RL can result in incredibly implausible trajectories" in order to claim that BoN is safer than RL, since I claim that Best-of-exp(30) is not clearly safe either!

Thanks! Causal Goodhart is a good point, and I buy now that RLHF seems even worse from a Goodhart perspective than filtering. Just unsure by how much, and how bad filtering itself is. In particular:

In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful

This is the part I'm still not sure about. For example, maybe the simplest/apparently-easiest-to-understand answer that looks good to humans tends to be false. Then if human raters prefer simpler answers (because the... (read more)

1Lawrence Chan3mo
Can you explain why RLHF is worse from a Causal Goodhart perspective?

It's not clear to me that 3. and 4. can both be true assuming we want the same level of output quality as measured by our proxy in both cases. Sufficiently strong filtering can also destroy correlations via Extremal Goodhart (e.g. this toy example). So I'm wondering whether the perception of filtering being safer just comes from the fact that people basically never filter strongly enough to get a model that raters would be as happy with as a fine-tuned one (I think such strong filtering is probably just computationally intractable?)

Maybe there is some more... (read more)

4davidad (David A. Dalrymple)3mo
Extremal Goodhart relies on a feasibility boundary in U,V-space that lacks orthogonality, in such a way that maximal U logically implies non-maximal V. In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful—even though there are also maximally human-approved answers that are minimally useful! I think the feasible zone here looks pretty orthogonal, pretty close to a Cartesian product, so Extremal Goodhart won't come up in either near-term or long-term applications. Near-term, it's Causal Goodhart [https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy#Causal_Goodhart] and Regressional Goodhart [https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy#Regressional_Goodhart], and long-term, it might be Adversarial Goodhart [https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy#Adversarial_Goodhart]. Extremal Goodhart might come into play if, for example, there are some truths about what's useful that humans simply cannot be convinced of. In that case, I am fine with answers that pretend those things aren't true, because I think the scope of that extremal tradeoff phenomenon will be small enough to cope with for the purpose of ending the acute risk period. (I would not trust it in the setting of "ambitious value learning that we defer the whole lightcone to.") For the record, I'm not very optimistic about filtering as an alignment scheme either, but in the setting of "let's have some near-term assistance with alignment research", I think Causal Goodhart [https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy#Causal_Goodhart] is a huge problem for RLHF that is not a problem for equally powerful filtering. Regressional Goodhart will be a problem in any case, but it might be manageable given a training distribution of human origin.

Thanks, computing J not being part of step 1 helps clear things up.

I do think that "realistically defining the environment" is pretty closely related to being able to detect deceptive misalignment: one way J could fail due to deception would be if its specification of the environment is good enough for most purposes, but still has some differences to the real world which allow an AI to detect the difference. Then you could have a policy that is good according to J, but which still destroys the world when actually deployed.

Similar to my comment in the other... (read more)

2davidad (David A. Dalrymple)3mo
To the final question, for what it’s worth to contextualize my perspective, I think my inside-view is simultaneously: * unusually optimistic about formal verification * unusually optimistic about learning interpretable world-models * unusually pessimistic about learning interpretable end-to-end policies
1davidad (David A. Dalrymple)3mo
I agree, if there is a class of environment-behaviors that occur with nonnegligible probability in the real world but occur with negligible probability in the environment-model encoded in J, that would be a vulnerability in the shape of alignment plan I’m gesturing at. However, aligning a predictive model of reality to reality is “natural” compared to normative alignment. And the probability with which this vulnerability can actually be bad is linearly related to something like total variation distance [https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures] between the model and reality; I don’t know if this is exactly formally correct, but I think there’s some true theorem vaguely along the lines of: a 1% TV distance could only cause a 1% chance of alignment failure via this vulnerability. We don’t have to get an astronomically perfect model of reality to have any hope of its not being exploited. Judicious use of worst-case maximin approaches (e.g. credal sets rather than pure Bayesian modeling) will also help a lot with narrowing this gap, since it will be (something like) the gap to the nearest point in the set rather than to a single distribution.

I see, that makes much more sense than my guess, thanks!

I'm pretty confused as to how some of the details of this post are meant to be interpreted, I'll focus on my two main questions that would probably clear up the rest.

Reward Specification: Finding a policy-scoring function  such that (nearly–)optimal policies for that scoring function are desirable.

If I understand this and the next paragraphs correctly, then J takes in a complete description of a policy, so it also takes into account what the policy does off-distribution or in very rare cases, is that right? So in this decomposition, "reward s... (read more)

2davidad (David A. Dalrymple)3mo
To the second point, I meant something very different—I edited this sentence and hopefully it is more clear now. I did not mean that T should respect extensional equivalence of policies (if it didn’t, we could always simply quotient it by extensional equivalence of policies, since it outputs rather than inputs policies). Instead, I meant that a training story that involves mitigating your model-free learning algorithm’s unbounded out-of-distribution optimality gap by using some kind of interpretability loop where you’re applying a detector function to the policy to check for inner misalignment (and using that to guide policy search) has a big vulnerability: the policy search can encode similarly deceptive (or even exactly extensionally equivalent) policies in other forms which make the deceptiveness invisible to the detector. Respecting extensional equivalence is a bare-minimum kind of robustness to ask from an inner-misalignment detector that is load-bearing in an existential-safety strategy.
2davidad (David A. Dalrymple)3mo
Thanks, this is very helpful feedback about what was confusing. Please do ask more questions if there are still more parts that are hard to interpret. To the first point, yes, J evaluates π on all trajectories, even off-distribution. It may do this in a Bayesian way, or a worst-case way. I claim that J does not need to “detect deceptive misalignment” in any special way, and I’m not optimistic that progress on such detection is even particularly helpful, since incompetence can also be fatal, and deceptive misalignment could Red Queen Race ahead of the detector. Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories. J can detect this by simply detecting bad stuff. If there’s a sneaky hard part of Reward Specification beyond the obvious hard part of defining what’s good and bad, it would be “realistically defining the environment.” (That’s where purely predictive models come in.)

I agree that aligned AI could also make humans irrelevant, but not sure how that's related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn't seem important for that argument, but maybe I'm misunderstanding what you're saying.

Interesting points, I agree that our response to part C doesn't address this well.

AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we... (read more)

2AdamGleave5mo
I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient. In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I've bought a dud on the basis of canny marketing, overall I'm much better off living in a modern capitalist economy than the stone age where humans were more directly in control. However, it does seem like there's a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, "slightly better than 2022" and "all humans dead" are rounding errors relative to "possible future human flourishing". But things look quite different under other ethical views, so I'm reluctant to conflate these outcomes.
0Luna Rimar5mo
This problem of human irrelevancy seems somewhat orthogonal to the alignment problem; even a maximally aligned AI will strip humans of their agency, as it knows best. Making the AI value human agency will not be enough; humans suck enough that the other objectives will override the agency penalty most of the time, especially in important matters.

Thanks for the interesting comments!

Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable".  I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").

Yeah, I didn't understand Katja's post as arguing (1), otherwise we'd have said more about that. Section C contains reasons for slow take-off, but my... (read more)

2David Scott Krueger5mo
Responding in order: 1) yeah I wasn't saying it's what her post is about.  But I think you can get two more interesting cruxy stuff by interpreting it that way. 2) yep it's just a caveat I mentioned for completeness. 3) Your spontaneous reasoning doesn't say that we/it get(/s) good enough at getting it to output things humans approve of before it kills us.  Also, I think we're already at "we can't tell if the model is aligned or not", but this won't stop deployment.  I think the default situation isn't that we can tell if things are going wrong, but people won't be careful enough even given that, so maybe it's just a difference of perspective or something... hmm.......  

Thanks for the comments!

One can define deception as a type of distributional shift. [...]

I technically agree with what you're saying here, but one of the implicit claims I'm trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.

2Johannes Treutlein5mo
Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.) I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm [https://www.alignmentforum.org/posts/CnruhwFGQBThvgJiX/formal-solution-to-the-inner-alignment-problem] works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would not go ahead and suddenly start maximizing some different goal function, but instead would query the human first). Stuart Armstrong might claim a similar thing about concept extrapolation? I personally think it is probably best to just try to work on deceptiveness directly instead of solving some more general problem and hoping non-deceptiveness is a side effect. It is probably harder to find a general solution than to solve only deceptiveness. Though maybe this depends on one's beliefs about what is easy or hard to do with deep learning.