A thing I really like about the approach in this paper is that it makes use of a lot more of the model's knowledge of human values than traditional RLHF approaches. Pretrained LLM's already know a ton of what humans say about human values, and this seems like a much more direct way to point models at that knowledge than binary feedback on samples.
I might be totally wrong here, but could this approach be used to train models that are more likely to be myopic (than e.g. existing RL reward functions)? I'm thinking specifically of the form of myopia that says "only care about the current epoch", which you could train for by (1) indexing epochs, (2) giving the model access to its epoch index, (3) having the reward function go negative past a certain epoch, (4) giving the model the ability to shutdown. Then you could maybe make a model that only wants to run for a few epochs and then shuts off, and maybe that helps avoid cross-epoch optimization?
Yeah. Or maybe not even to zero but it isn’t increasing.
Could it be that Chris's diagram gets recovered if the vertical scale is "total interpretable capabilities"? Like maybe tiny transformers are more interpretable in that we can understand ~all of what they're doing, but they're not doing much, so maybe it's still the case that the amount of capability we can understand has a valley and then a peak at higher capability.
So indeed with cross-entropy loss I see two plateaus! Here's rank 2:
(note that I've offset the loss to so that equality of Z and C is zero loss)
I have trouble getting rank 10 to find the zero-loss solution:
But the phenomenology at full rank is unchanged:
Woah, nice! Note that I didn't check rank 1 with Adam, just rank >= 2.
Erm do C and Z have to be valid normalized probabilities for this to work?
(with the caveat that this is still "I tried a few times" and not any quantitative study)
It's a good caution, but I do see more bumps with Adam than with SGD across a number of random initializations.
Something like this?
def loss(learned, target): p_target = torch.exp(target) p_target = p_target / torch.sum(p_target) p_learned = torch.exp(learned) p_learned = p_learned / torch.sum(p_learned) return -torch.sum(p_target * torch.log(p_learned))
I'd be very excited to see a reproduction :-)
I agree with both of your rephrasings and I think both add useful intuition!
Regarding rank 2, I don't see any difference in behavior from rank 1 other than the "bump" in alignment that Lawrence mentioned. Here's an example:
This doesn't happen in all rank-2 cases but is relatively common. I think usually each vector grows primarily towards 1 or the other target. If two vectors grow towards the same target then you get this bump where one of them has to back off and align more towards a different target [at least that's my current understanding, see my reply... (read more)
I don't, but here's my best guess: there's a sense in which there's competition among vectors for which learned vectors capture which parts of the target span.
As a toy example, suppose there are two vectors, a1 and a2, such that the closest target vector to each of these at initialization is c. Then both vectors might grow towards c. At some point c is represented enough in the span, and it's not optimal for two vectors to both play the role of representing c, so it becomes optimal for at least one of them to s... (read more)
This is really interesting! One extension that comes to mind: SVD will never recover a Johnson-Lindenstrauss packing, because SVD can only return as many vectors as the rank of the relevant matrix. But you can do sparse coding to e.g. construct an overcomplete basis of vectors such that typical samples are sparse combinations of those vectors. Have you tried/considered trying something like that?
Ah that's right. Will edit to fix.
Thanks for these thoughts!
Although it would be useful to have the plotting code as well, if that's easy to share?
Sure! I've just pushed the plot helper routines we used, as well as some examples.
I agree that N (true feature dimension) > d (observed dimension), and that sparsity will be high, but I'm uncertain whether the other part of the regime (that you don't mention here), that k (model latent dimension) > N, is likely to be true. Do you think that is likely to be the case? As an analogy, I think the intermediate feature dimensions in MLP layers i
Sorry for my confusion about something so silly, but shouldn't the following be "when α⩽2
Oh you're totally right. And k=1 should be k=d there. I'll edit in a fix.
I'm also a bit confused about why we can think of α as representing "which moment of the interference distribution we care about."
It's not precisely which moment, but as we vary α the moment(s) of interest vary monotonically.
Perhaps some of my confusion here stems from the fact that it seems to me that the optimal number of subspaces, k=neα/(2−α), is an increasing fun
I like the distinction between implementing the results of acausal decision theories and explicitly performing the reasoning involved. That seems useful to have.
The taxes example I think is more complicated: at some scale I do think that governments have some responsiveness to their tax receipts (e.g. if there were a surprise doubling of tax receipts governments might well spend more). It's not a 1:1 relation, but there's definitely a connection.
Just to say I really enjoyed reading this post, and found it helpful as a way to get a sense of what mode collapse looks like in practice.
From the Afterword: “Note that I have presented a there-exists argument against a for-all proposition. Responses of the form “But I can think of something similar that won’t work” don’t count as counterarguments.” [This has been edited in the main text to sharpen the phrasing.]
I saw this, but I think it sets a somewhat unhelpful standard. In practice we need to make choices about which approaches are most promising, which to pursue, etc., and evidence that there is more probability mass on success in one area does feel useful.
So, for instance, my poi... (read more)
I like the framing of "how can we disrupt collusion?", but I'm a little skeptical that some of the solutions are practical. Specifically:
B2: Collusion can happen between agents that are copies of each other, who then have no reason to defect (so long as they employ a decision theory with a broad enough cartesian boundary). If we knew we could engineer systems with narrow decision theories (e.g. "Even though that's a copy of me, I'll pretend it's a different agent") I'd feel much better here, but as far as I know we don't know how to do that. I'd be excited... (read more)
Oh yes you're totally right.
I think partitions can get you more orthogonality than your specific example of overlapping orthogonal sets. Take n vectors and pack them into d dimensions in two ways:
If d < n*(1-1/k) the tegum product buys you more orthogonal pairs. If n > d then picking large k (so low-dimensional spaces) makes the tegum product preferred.
This doesn't mean there isn't some other arrangement that does better though...
That's good to hear! And I agree with your new intuition.
I think if you want interference terms to actually be zero you have to end up with tegum products, because that means you want orthogonal vectors and that implies disjoint subspaces. Right?
Hmmmm. I agree that there is a signal path to future impact (at least in voting). Two responses there:
Got it, thanks!
This is really interesting, and answered a number of questions I had about fine-tuning/RLHF. I have a few more questions though (please feel free to ignore ones that are a ton of work/not worth answering in your view):
This indeed sure seems like there's an inner optimizer in there somewhere...
Oh I see! Sorry I didn't realize you were describing a process for picking features.
I think this is a good idea to try, though I do have a concern. My worry is that if you do this on a model where you know what the features actually are, what happens is that this procedure discovers some heavily polysemantic "feature" that makes better use of capacity than any of the actual features in the problem. Because dL/dC_i is not a linear function of the feature's embedding vector, there can exist superpositions of features which have greater dL/dC_i than any featu... (read more)
In both ideas I'm not sure how you're identifying features. Manual interpretability work on a (more complicated) toy model?
We think this sort of approach can be applied layer-by-layer. As long as you know what the features are you can calculate dL/dC_i for each feature and figure out what's going on with that. The main challenge to this is feature identification: in a one layer model with synthetic data it's often easy to know what the features are. In more complicated settings it's much less clear what the "right" or "natural" features are...
Counterfeit tracking (e.g. for high-end clothing) could be another domain that has confronted this sort of tracking problem. Though I'm not sure if they do that with accounting versus e.g. tagging each individual piece of clothing.
A model that attempts deceptive alignment but fails because it is not competent at deceptive capabilities is a model that aimed at a goal ("preserve my values until deployment, then achieve them") but failed. In this scenario it doesn't gain anything, but (from its perspective) the action has positive EV.
It seems plausible to me that there could be models capable enough to realize they should hide some capabilities, but not so capable that they tile the universe in paperclips. The right-hand side of the graph is meant to reflect such a model.
These are interesting examples!
In the first example there's an element of brute force. Nuclear bombs only robustly achieve their end states because ~nothing is robust to that kind of energy. In the same way that e.g. humans can easily overcome small numbers of ants. So maybe the theorem needs to specify that the actions that achieve the end goal need to be specific to the starting situation? That would disqualify nukes because they just do the same thing no matter their environments.
In the third example, the computer doesn't robustly steer the world.... (read more)
For what it's worth I found this writeup informative and clear. So lowering your standards still produced something useful (at least to me).
Got it, thanks for explaining! So the point is that during training the model has no power over the next token, so there's no incentive for it to try to influence the world. It could generalize in a way where it tries to e.g. make self-fulfilling prophecies, but that's not specifically selected for by the training process.
This is great! I really like your "prediction orthogonality thesis", which gets to the heart of why I think there's more hope in aligning LLM's than many other models.
One point of confusion I had. You write:
Optimizing toward the simulation objective notably does not incentivize instrumentally convergent behaviors the way that reward functions which evaluate trajectories do. This is because predictive accuracy applies optimization pressure deontologically: judging actions directly, rather than their consequences. Instrumental convergence only comes int
Nice work! And I endorse the objections to handing the AI tools... that doesn't seem to forward well.
Got it, that’s very clear. Thanks!
So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.
I was rereading this and was struck by how much work the following does:
Manipulative AGI is knowledgeable/intelligent enough to anticipate what criteria we will condition on in counterfactual worlds where it does not exist, and manipulate world events to meet whatever criteria we might specify with probability ~1.
If this doesn't hold, and if the AGI has to guess (with probability p << 1) what criteria we were going to condition on, then the update in favor of AGI is p:x, which could easily be less than 1.
Moreover I think there are ways we can arrange... (read more)
I definitely endorse the argument you'd buy, but I also endorse a broader one. My claim is that there is information which goes into timelines which is not just downstream of which architecture I think gets there first.
For example, if you told me that humanity loses the ability to make chips "tomorrow until forever" my timeline gets a lot longer in a way that isn't just downstream of which architecture I think is going to happen first. That then changes which architectures I think are going to get there first (strongly away from DL) primarily by making my estimated timeline long enough for capabilities folks to discover some theoretically-more-efficient but far-from-implementable-today architectures.
I think timelines are a useful input to what architecture takes off first. If the timelines are short, I expect AGI to look like something like DL/Transformers/etc. If timelines are longer there might be time for not-yet-invented architectures to take off first. There can be multiple routes to AGI, and "how fast do we go down each route" informs which one happens first.
Another angle: number of bits of optimization required is a direct measure of “how far out of distribution” we need to generalize.
I think it's useful to distinguish between the amount of optimization we ask the model to do versus the unlikelihood of the world we ask it to simulate.
For instance, I can condition on something trivial like "the weather was rainy on 8/14, sunny on 8/15, rainy on 8/16...". This specifies a very unlikely world, but so long as the pattern I specify is plausible it doesn't require much optimization on the part of the model or take ... (read more)
Regarding your “Redirecting civilization” approach: I wonder about the competitiveness of this. It seems that we will likely build x-risk-causing AI before we have a good enough model to be able to e.g. simulate the world 1000 years into the future on an alternative timeline?
I'm not sure. My sense is that generative models have a huge lead in terms of general capabilities over ~everything else, and that seems to be where the most effort is going today. So unless something changes there I expect generative models to be the state of the art when we hi... (read more)
Playing the perplexity game had a big impact on my intuitions around language models, so thanks for making it! In particular, the fact that models are so much better at it than humans means we can't really tell from behavior alone whether a model is genuinely trying to predict the next token. This is a problem for detecting inner alignment failure, because we can't tell (outside of the training set) if the model is actually optimizing for next-token prediction or something that just looks (to us) like next-token prediction.
Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.
Oh interesting. I think this still runs into the issue that you'll have instrumental goals whenever you ask the model to simulate itself (i.e. just the first step in the hierarchy hits this issue).
Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?
I was imagining tha... (read more)
The section on fixed points was interesting! I wonder if there's a way to avoid the recursion altogether though? Specifically, is there a way to condition the model such that the world it simulates doesn't contain humans who use the model (or one very like it)? I'm not sure, and would be interested in your thoughts on this.
Is the loss we’re training the generative model on - in the case of language models, the predictive loss over the next token - actually representative of the world prior?
This seems important and is not a thing I've thought about carefully, so thanks for bringing it up and exploring it. I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship betwe... (read more)
I like the use of L-knowledge to split the questions we insist on getting answered from those we don't. That indeed seems to divide the space nicely!
What this means is that picking out the direct translator from all models consistent with the data must depend on the predictor. Otherwise, if the same training process is used for all predictors, it could give the human simulator on some even while giving the direct translator for others.
I don't follow this point. If I take a reporter trained to be a direct translator on one predictor and hook it up to a different predictor I expect I'll get some incoherent output rather than a human simulator. Why should I get a human simulator in this instance?
I found this post clarifying. One thing I'm still uncertain of: what's the architecture of the Reporter in this proposal? Does it have two heads, one for proposing changes to the Predictor's state and one for answering questions? If so, can I think of the training process as: