All of RobertKirk's Comments + Replies

Causal confusion as an argument against the scaling hypothesis

I expect that these kinds of problems could mostly be solved by scaling up data and compute (although I haven't read the paper). However, the argument in the post is that even if we did scale up, we couldn't solve the OOD generalisation problems.

Causal confusion as an argument against the scaling hypothesis

Here we're saying that the continual fine-tuning might not necessarily resolve causal confusion within the model; instead, it will help the model learn the (new) spurious correlations so that it still performs well on the test data. This is assuming that continual fine-tuning is using a similar ERM-based method (e.g. the same pretraining objective but on the new data distribution). In hindsight, we probably should have written "continual training" rather than specifically "continual fine-tuning". If you could continually train online in the deployment environment then that would be better, and whether it's enough is very related to whether online training is enough, which is one of the key open questions we mention.

A transparency and interpretability tech tree

The ability to go 1->4 or 2->5 by the behavioural-cloning approach would assume that the difficulty of interpreting all parts of the model are fairly similar, but it just takes time for the humans to interpret all parts, so we can automate that by imitating the humans. But if understanding the worst-case stuff is significantly harder than the best-case stuff (which seems likely to me) then I wouldn't expect the behaviourally-cloned interpretation agent to generalise to being able to correctly interpret the worse-case stuff.

A transparency and interpretability tech tree

Another point worth making here is why I haven’t separated out worst-case inspection transparency for deceptive models vs. worst-case training process transparency for deceptive models there. That’s because, while technically the latter is strictly more complicated than the former, I actually think that they’re likely to be equally difficult. In particular, I suspect that the only way that we might actually have a shot at understanding worst-case properties of deceptive models is through understanding how they’re trained.

I'd be curious to hear a bit mor... (read more)

Epistemological Vigilance for Alignment

This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.

This seems to me that you want a word for whatever the opposite of complex/chaotic systems are, right? Although obviously "Simple" is probably not the best word (as it's very generic). It could be "Simple Dynamics" or "Predictable Dynamics"?

Epistemological Vigilance for Alignment

Newtonian: complex reactions

So please suggest alternative names and characterizations, or ask questions to pinpoint what I’m describing.

Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes? That's what the "complex reactions" and some of the references kind of point at, but then in the description you seem to be talking mor... (read more)

3Adam Shimi1mo
This points at the same thing IMO, although still in a confusing way. This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward. Someone at Conjecture proposed linear too, but Newtonian physics isn't linear. Although I agree that the sort of behavior and reaction I'm pointing out fit within the "non-linear" category.
RL with KL penalties is better seen as Bayesian inference

So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?

But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented

... (read more)
Richard Ngo's Shortform

Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the sam... (read more)

RL with KL penalties is better seen as Bayesian inference

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations. This setting makes the distinction between a generative model and a policy clearer, and maybe changes the relevance of this statement:

The problem with the RL

... (read more)
2Tomek Korbak1mo
That's a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distribution. You can equivalently view it as a collection of unconditional distributions {pi_s(a)}, one for each s, and for each of these you are likely to also have distribution collapse (single best action for a given state). Arguably, that's what you want in RL. So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value? But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?
Reshaping the AI Industry

That's my guess also, but I'm more asking just in case that's not the case, and he disagrees with (for example) the Pragmatic AI Safety sequence, in which case I'd like to know why.

5Evan Hubinger1mo
I was referring to stuff like this [https://www.lesswrong.com/posts/j9Q8bRmwCgXRYAgcJ/miri-announces-new-death-with-dignity-strategy] , this [https://www.lesswrong.com/posts/wrkEnGrTTrM2mnmGa/retracted-it-s-time-for-ea-leadership-to-pull-the-short] , and this [https://www.lesswrong.com/posts/kipMvuaK3NALvFHc9/what-an-actually-pessimistic-containment-strategy-looks-like] . I haven't finished it yet, but I've so far very much enjoyed the Pragmatic AI Safety sequence [https://www.alignmentforum.org/s/FaEBwhhe3otzYKGQt], though I certainly have disagreements with it.
Reshaping the AI Industry

I'd be curious to hear what your thoughts are on the other conversations, or at least specifically which conversations you're not a fan of?

My guess is that Evan dislikes the apocalyptic /panicky conversations that people are recently having on Lesswrong

How can Interpretability help Alignment?

(Note: you're quoting your response as well as the sentence you've meant to be quoting (and responding to), which makes it hard to see which part is your writing. I think you need 2 newlines to break the quote formatting).

Do you see a way of incentivizing the RL community to change this? (If possible, that would seem like a more effective approach than doing it "ourselves".)

I think this is kind of the same as how do we incentivise the wider ML community to think safety is important? I don't know if there's anything specific about the RL community which

... (read more)