Thane Ruthenis - AI Alignment Forum

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

That was my interpretation as well.

I think it does look pretty alarming if we imagine that this scales, i. e., if these learned implicit concepts can build on each other. Which they almost definitely can.

The "single-step" case, of the SGD chiseling-in a new pattern which is a simple combination of two patterns explicitly represented in the training data, is indeed unexciting. But once that pattern is there, the SGD can chisel-in another pattern which uses the first implicit pattern as a primitive. Iterate on, and we have a tall tower of implicit patterns building on implicit patterns, none of which are present in the training data, and which can become arbitrarily more sophisticated and arbitrarily more alien than anything in the training set. And we don't even know what they are, so we can't assess their safety, and can't even train them out (because we don't know how to elicit them).

Which, well, yes: we already knew all of this was happening. But I think this paper is very useful in clearly showcasing this.

One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn't demand anything like this. That ability – to translate the latent representation of f(blah) into humanese – appeared coincidentally, as the result of the SGD chiseling-in some module for merely predicting f(blah).

Which implies some interesting things about how the representations are stored. The LLM actually "understands" what f(blah) is built out of, in a way that's accessible to its externalized monologue. That wasn't obvious to me, at least.

TurnTrout's shortform feed

Thane Ruthenis5mo1112

I think that the key thing we want to do is predict the generalization of future neural networks.

It's not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.

My impression is that you think that pretraining+RLHF (+ maybe some light agency scaffold) is going to get us all the way there, meaning the predictive power of various abstract arguments from other domains is screened off by the inductive biases and other technical mechanistic details of pretraining+RLHF. That would mean we don't need to bring in game theory, economics, computer security, distributed systems, cognitive psychology, business, history into it – we can just look at how ML systems work and are shaped, and predict everything we want about AGI-level systems from there.

I disagree. I do not think pretraining+RLHF is getting us there. I think we currently don't know what training/design process would get us to AGI. Which means we can't make closed-form mechanistic arguments about how AGI-level systems will be shaped by this process, which means the abstract often-intuitive arguments from other fields do have relevant things to say.

And I'm not seeing a lot of ironclad arguments that favour "pretraining + RLHF is going to get us to AGI" over "pretraining + RLHF is not going to get us to AGI". The claim that e. g. shard theory generalizes to AGI is at least as tenuous as the claim that it doesn't.

Flagging that this is one of the main claims which we seem to dispute; I do not concede this point FWIW.

I'd be interested if you elaborated on that.

Toward A Mathematical Framework for Computation in Superposition

Thane Ruthenis6mo98

Haven't read everything yet, but that seems like excellent work. In particular, I think this general research avenue is extremely well-motivated.

Figuring out how to efficiently implement computations on the substrate of NNs had always seemed like a neglected interpretability approach to me. Intuitively, there are likely some methods of encoding programs into matrix multiplication which are strictly ground-truth better than any other encoding methods. Hence, inasmuch as what the SGD is doing is writing efficient programs on the NN substrate, it is likely doing so by making use of those better methods. And so nailing down the "principles of good programming" on the NN substrate should yield major insights regarding how the naturally-grown NN circuits are shaped as well.

This post seems to be a solid step in that direction!

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Thane Ruthenis6mo106

I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure.

Context: I firmly hold a MIRI-style "alignment is extremely hard" view, but I am also unusually sympathetic to Quintin/Nora's arguments. So here's my outline of the model of that whole debate.

Layer 1: I think there is nonzero meat to the argument that developing deceptive circuits is a meaningfully difficult step, and that humans training them in from outside the system changes the setup in a way that invalidates its implications for strict deceptive alignment.

For the AI model to naturally develop deception, it'd need to have either:

Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.
- That's plausibly something to which "the SGD will just train it out" would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it'd decide to pretend to want after it re-derives the need for deception).
Ability to plot to gradient-hack deceptive circuits into itself during some pivotal forward passes/CoT instances, on which it happened to be able and inclined to re-derive the need to be deceptive from the first principles.
- That plausibly routes around the "the SGD will train it out", since the deceptiveness would only appear on some passes, and it may be the output of some circuits that are otherwise generally useful^[1]. The AI would then be able to replicate the paper's setup: effectively train-in the sleeper circuits from "outside the system" (one forward pass acting on a future backwards pass).
- The caveat, of course, is that the gradient-hacking capability is highly non-trivial, and raises its own doubts.

In contrast, the paper here presents an already chiseled-in instance of deceptive alignment, which artificially routed around the actually-hard steps of developing robust "sleeper circuits" activating in response to specific stimuli.^[2]

Layer 2: But while the argument has some meat, it doesn't have much of it. An LLM learning some unexpected trigger-action pattern happens all the time in the wild: hallucinations, mistakes that a human wouldn't make, exigencies of RLHF'd-in corporate sycophancy like "I think it's better to let people die than to say something racist", et cetera.

Goal misgeneralization of this format does occur in the wild. And indeed, that's one of the core reasons why we expect deceptive alignment in RLHF-style setups in the first place (#20 here): because the training dataset would be labeled in a systemically incorrect fashion, such that the most faithful and compact description of it (in goal format) is something omnicidal, not eudaimonic.

Layer 3: The counter³-argument from the optimists that I predict here is disagreeing that the errors in the dataset would produce systemic deceptive alignment, rather than something that could be basically approximated as "pro-social behavior plus noise".

I see nonzero meat on this one as well. In the paper, the sleeper circuits were specifically calculated by humans to be demonstratively scary: an AI model that waits until deployment-time, then "tears off the mask" and goes off to pursue its true nefarious desires. Those circuits produce a coherent misaligned behavior, with a clear point for the treacherous turn.

If this sort of behavior arose naturally, however? We can imagine there being much more flip-flopping. The AI would start out acting normal, then see some pattern that makes it nefarious and deceptive, then see something which makes it switch again and e. g. backtrack on its omnicidal plans or even self-report its bad thoughts to humans, et cetera. This would look more like the Waluigi Effect, rather than full-scale deceptive alignment. Which is a much more manageable threat.

In other words: there's some "slack" that the SGD leaves the model. That slack could be arranged into the shape of deceptive alignment. But unless there's someone external and intelligent acting on the system, that slack will likely just take on the shape of approximately-noise. (Optimization amplifies, but if there's no-one outside the system to amplify...)

Layer 4: Except the next issue is that the previous argument defended LLMs being safe by arguing that they'd be unable to coherently pursue goals across different contexts. Which means it argued they're not AGI, and that their whole training setup can't generalize to AGI.

That's the whole issue with the optimstic takes that I keep arguing about. Their "alignment guarantees" are also "incapability guarantees".

Inasmuch as AI models would start to become more AGI-like, those guarantees would start falling away. Which means that, much like the alignment-is-hard folks keep arguing, the AI would start straightening out these basically-noise incoherencies in its decisions. (Why? To, well, stop constantly flip-flopping and undermining itself. That certainly sounds like an instrumental goal that any agent would convergently develop, doesn't it?)

As it's doing so, it would give as much weight to the misgeneralized unintended-by-us "noise" behaviors as to the intended-by-us aligned behaviors. It would integrate them into its values. At that point, the fact that the unintended behaviors are noise-to-us rather than something meaningful-if-malign, would actually make the situation worse. We wouldn't be able to predict what goals it'd arrive at; what philosophy its godshatter would shake out to mean!

In conclusion: I don't even know. I think my Current AIs Provide Nearly No Data Relevant to AGI Alignment argument applies full-force here?

Yes, we can't catch backdoors in LLMs.
Yes, the scary backdoor in the paper was artificially introduced by humans.
Yes, LLMs are going to naturally develop some unintended backdoor-like behaviors.
Yes, those behaviors won't be as coherently scary as if they were designed by a human; they'd be incoherent.
Yes, the lack of coherency implies that these LLMs fall short of AGI.

But none of these mechanisms strictly correspond to anything in the real AGI threat model.

And while both the paper and the counter-arguments to it provide some metaphor-like hints about the shape of the real threat, the locuses of both sides' disagreements lie precisely in the spaces in which they try to extrapolate each others' results in a strictly technical manner.

Basically, everyone is subtly speaking past each other. ~~Except me, whose vision has a razor-sharp clarity to it~~.

^{^}
Like, in the context of batch training: Imagine that there are some circuits that produce deceptiveness on some prompts , and highly useful behaviors on other prompts $Y$ . There are no nearby circuits that produce results as good on $Y$ while not being deceptive on $X$ . So while the SGD's backwards passes on $X$ would try to remove these circuits, the backwards passes on $Y$ would try to reinforce them, and the sum of these influences would approximately cancel out. So the circuits would stay.
Well, that's surely a gross oversimplification. But that's the core dynamic.
^{^}
That said, I think the AI-control-is-easy folks actually were literally uttering the stronger claim of "all instances of deception will be trained out". See here:
If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.
That sure sounds like goalpost-moving on their part. I don't believe it is, though. I do think they thought the quoted sentence was basically right, but only because at the time of writing, they'd failed to think in advance about some tricky edge cases that were permitted on their internal model, but which would make their claims-as-stated sound strictly embarrassingly false.
I hope they will have learned the lesson about how easily reality can Goodhart at their claims, and how hard it is to predict all ways this could happen and make their claims inassailably robust. Maybe that'll shed some light about the ways they may be misunderstanding their opponents' arguments, and why making up robust clearly-resolvable empirical predictions is so hard. :P

Value systematization: how values become coherent (and misaligned)

Thane Ruthenis6mo10

E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating

That seems more like value reflection, rather than a value change?

The way I'd model it is: you have some value , whose implementations you can't inspect directly, and some guess about what it is $P (v (x))$ . (That's how it often works in humans: we don't have direct knowledge of how some of our values are implemented.) Before you were introduced to the question $Q$ of "what if we swap the gear for a different one: which one would you care about then?", your model of that value put the majority of probability mass on $v_{1} (x)$ , which was "I value this particular gear". But upon considering $Q$ , your PD over $v (x)$ changed, and now it puts most probability on $v_{2} (x)$ , defined as "I care about whatever gear is moving the piston".

Importantly, that example doesn't seem to involve any changes to the object-level model of the mechanism? Just the newly-introduced possibility of switching the gear. And if your values shift in response to previously-unconsidered hypotheticals (rather than changes to the model of the actual reality), that seems to be a case of your learning about your values. Your model of your values changing, rather than them changing directly.

(Notably, that's only possible in scenarios where you don't have direct access to your values! Where they're black-boxed, and you have to infer their internals from the outside.)

the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations

Sounds right, yep. I'd argue that translating a value up the abstraction levels would almost surely lead to simpler cached strategies, though, just because higher levels are themselves simpler. See my initial arguments.

insofar as you value simplicity (which I think most agents strongly do) then you're going to systematize your values

Sure, but: the preference for simplicity needs to be strong enough to overpower the object-level values it wants to systematize, and it needs to be stronger than them the more it wants to shift them. The simplest values are no values, after all.

I suppose I see what you're getting at here, and I agree that it's a real dynamic. But I think it's less important/load-bearing to how agents work than the basic "value translation in a hierarchical world-model" dynamic I'd outlined. Mainly because it routes through the additional assumption of the agent having a strong preference for simplicity.

And I think it's not even particularly strong in humans? "I stopped caring about that person because they were too temperamental and hard-to-please; instead, I found a new partner who's easier to get along with" is something that definitely happens. But most instances of value extrapolation aren't like this.

Value systematization: how values become coherent (and misaligned)

Thane Ruthenis7mo10

Let me list some ways in which it could change:

If I recall correctly, the hypothetical under consideration here involved an agent with an already-perfect world-model, and we were discussing how value translation up the abstraction levels would work in it. That artificial setting was meant to disentangle the "value translation" phenomenon from the "ontology crisis" phenomenon.

Shifts in the agent's model of what counts as "a gear" or "spinning" violate that hypothetical. And I think they do fall under the purview of ontology-crisis navigation.

Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn't forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear's role in the whole mechanism)?

I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that's a separate topic. I think that, if you have some value , and it doesn't run into direct conflict with any other values you have, and your model of $x$ isn't wrong at the abstraction level it's defined at, you'll never want to change $v (x)$ .

You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position

That's the closest example, but it seems to be just an epistemic mistake? Your value is well-defined over "the gear that was driving the piston". After you learn it's a different gear from the one you thought, that value isn't updated: you just naturally shift it to the real gear.

Plainer example: Suppose you have two bank account numbers at hand, A and B. One belongs to your friend, another to a stranger. You want to wire some money to your friend, and you think A is their account number. You prepare to send the money... but then you realize that was a mistake, and actually your friend's number is B, so you send the money there. That didn't involve any value-related shift.

I'll try again to make the human example work. Suppose you love your friend, and your model of their personality is accurate – your model of what you value is correct at the abstraction level at which "individual humans" are defined. However, there are also:

Some higher-level dynamics you're not accounting for, like the impact your friend's job has on the society.
Some lower-level dynamics you're unaware of, like the way your friend's mind is implemented at the levels of cells and atoms.

My claim is that, unless you have terminal preferences over those other levels, then learning to model these higher- and lower-level dynamics would have no impact on the shape of your love for your friend.

Granted, that's an unrealistic scenario. You likely have some opinions on social politics, and if you learned that your friend's job is net-harmful at the societal level, that'll surely impact your opinion of them. Or you might have conflicting same-level preferences, like caring about specific other people, and learning about these higher-level societal dynamics would make it clear to you that your friend's job is hurting them. Less realistically, you may have some preferences over cells, and you may want to... convince your friend to change their diet so that their cellular composition is more in-line with your aesthetic, or something weird like that.

But if that isn't the case – if your value is defined over an accurate abstraction and there are no other conflicting preferences at play – then the mere fact of putting it into a lower- or higher-level context won't change it.

Much like you'll never change your preferences over a gear's rotation if your model of the mechanism at the level of gears was accurate – even if you were failing to model the whole mechanism's functionality or that gear's atomic composition.

(I agree that it's a pretty contrived setup, but I think it's very valuable to tease out the specific phenomena at play – and I think "value translation" and "value conflict resolution" and "ontology crises" are highly distinct, and your model somewhat muddles them up.)

^{^}
Although there may be higher-level dynamics you're not tracking, or lower-level confusions. See the friend example below.

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis7mo40

No, I am in fact quite worried about the situation

Fair, sorry. I appear to have been arguing with my model of someone holding your general position, rather than with my model of you.

I think these AGIs won't be within-forward-pass deceptively aligned, and instead their agency will eg come from scaffolding-like structures

Would you outline your full argument for this and the reasoning/evidence backing that argument?

To restate: My claim is that, no matter much empirical evidence we have regarding LLMs' internals, until we have either an AGI we've empirically studied or a formal theory of AGI cognition, we cannot say whether shard-theory-like or classical-agent-like views on it will turn out to have been correct. Arguably, both side of the debate have about the same amount of evidence: generalizations from maybe-valid maybe-not reference classes (humans vs. LLMs) and ambitious but non-rigorous mechanical theories of cognition (the shard theory vs. coherence theorems and their ilk stitched into something like my model).

Would you disagree? If yes, how so?

Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis7mo40

Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion

Sure. I mean, that seems like a meaningfully weaker generalization, but sure. That's not the main issue.

Here's how the whole situation looks like from my perspective:

We don't know how generally-intelligent entities like humans work, what the general-intelligence capability is entangled with.
Our only reference point is humans. Human exhibit a lot of dangerous properties, like deceptiveness and consequentialist-like reasoning that seems to be able to disregard contextually-learned values.
There are some gears-level models that suggest intelligence is necessarily entangled with deception-ability (e. g., mine), and some gears-level models that suggest it's not (e. g., yours). Overall, we have no definitive evidence either way. We have not reverse-engineered any generally-intelligent entities.
We have some insight into how SOTA AIs work. But SOTA AIs are not generally intelligent. Whatever safety assurances our insights into SOTA AIs give us, do not necessarily generalize to AGI.
SOTA AIs are, nevertheless, superhuman at some tasks at which we've managed to get them working so far. By volume, GPT-4 can outperform teams of coders, and Midjourney is putting artists out of business. The hallucinations are a problem, but if it were gone, they'd plausibly wipe out whole industries.
An AI that outperforms humans at deception and strategy by the same margin as GPT-4/Midjourney outperform them at writing/coding/drawing would plausibly be an extinction-level threat.
The AI industry leaders are purposefully trying to build a generally-intelligent AI.
The AI industry leaders are not rigorously checking every architectural tweak or cute AutoGPT setup to ensure that it's not going to give their model room to develop deceptive alignment and other human-like issues.
Summing up: There's reasonable doubt regarding whether AGIs would necessarily be deception-capable. Highly deception-capable AGIs would plausibly be an extinction risk. The AI industry is currently trying to blindly-but-purposefully wander in the direction of AGI.
- Even shorter: There's a plausible case that, on its current course, the AI industry is going to generate an extinction-capable AI model.
- There are no ironclad arguments against that, unless you buy into your inside-view model of generally-intelligent cognition as hard as I buy into mine.
And what you effectively seem to be saying is "until you can rigorously prove that AGIs are going to develop dangerous extinction-level capabilities, it is totally fine to continue blindly scaling and tinkering with architectures".
What I'm saying is "until you can rigorously prove that a given scale-up plus architectural tweak isn't going to result in a superhuman extinction-enthusiastic AGI, you should not be allowed to test that empirically".

Yes, "prove that this technological advance isn't going to kill us all or you're not allowed to do it" is a ridiculous standard to apply in the general case. But in this one case, there's a plausible-enough argument that it might, and that argument has not actually been soundly refuted by our getting some insight into how LLMs work and coming up with a theory of their cognition.

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

Thane Ruthenis7mo10

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap

Yup! Cryptography actually was the main thing I was thinking about there. And there's indeed some relation. For example, it appears that is because our universe's baseline "forward-pass functions" are just poorly suited for being composed into functions solving certain problems. The environment doesn't calculate those; all of those are in $P$ .

However, the inversion of the universe's forward passes can be NP-complete functions. Hence a lot of difficulties.

~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then

2030 is the target for having completed the "hire a horde of mathematicians and engineers and blow the problem wide open" step, to be clear. I don't expect the theoretical difficulties to take quite so long.

Can you tell me what is the hard part in formalizing the following:

Usually, the hard part is finding a way to connect abstract agency frameworks to reality. As in: here you have your framework, here's the Pile, now write some code to make them interface with each other.

Specifically in this case, the problems are:

an efficient approximately Bayesian approach

What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What's the algorithm for this?

It understands (with some current uncertainty) what preference ordering the humans each have

How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

Thane Ruthenis7mo10

Yup. I think this might route through utility as well, though. Observations are useful because they unlock bits of optimization, and bits related to different variables could unlock both different amounts of optimization capacity, and different amounts of goal-related optimization capacity. (It's not so bad to forget a single digit of someone's phone number; it's much worse if you forgot a single letter in the password to your password manager.)

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments