AI Alignment @ OpenAI
One reason it might not be fitting as well for vision, is that vision has much more weight-tying / weight-reuse in convolutional filters. If the underlying variable that mattered was compute, then image processing neural networks would show up more prominently in compute (rather than parameters).
I feel overall confused, but I think that's mostly because of me missing some relevant background to your thinking, and the preliminary/draft nature of this.
I hope sharing my confusions is useful to you. Here they are:
I'm not sure how the process of "spending bits" works. If the space of possible models was finite and discretized, then you could say spending bits is partitioning down to "1/2^B"th of the space -- but this is not at all how SGD works, and seems incompatible with using SGD (or any optimizer that doesn't 'teleport' through parameter space) as the optimization algorithm.Spending bits does make sense in terms of naive rejection sampling (but I think we agree this would be intractably expensive) and other cases of discrete optimization like integer programming. It's possible I would be less confused if this was explained using a different optimization algorithm, like BFGS or some hessian-based method or maybe a black-box bayesian solver.Separately, I'm not sure why the two heads wouldn't just end up being identical to each other. In shorter-program-length priors (which seem reasonable in this case; also minimal-description-length and sparse-factor-graph, etc etc) it seems like weight-tying the two heads or otherwise making them identical.Lastly, I think I'm confused by your big formula for the unnormalized posterior log probability of (θ1, θ2) -- I think the most accessible of my confusions is that it doesn't seem to pass "basic type checking consistency".I know the output should be a log probability, so all the added components should be logprobs/in terms of bits.The L() term makes sense, since it's given in terms of bits.The two parameter distances seem like they're in whatever distance metric you're using for parameter space, which seems to be very different from the logprobs. Maybe they both just have some implicit unit conversion parameter out front, but I think it'd be surprising if it were the case that every "1 parameter unit" move through parameter space is worth "1 nat" of information. For example, it's intuitive to me that some directions (towards zero) would be more likely than other directions.The C() term has a lagrange multiplier, which I think are usually unitless. In this case I think it's safe to say it's also maybe doing units conversion. C() itself seems to possibly be in terms of bits/nats, but that isn't clear.In normal lagrangian constrained optimization, lambda would be the parameter that gives us the resource tradeoff "how many bits of (L) loss on the data set tradeoff with a single bit (C) of inconsistency"Finally the integral is a bit tricky for me to follow. My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space. That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.(I realize that unit-type-checking ML is pretty uncommon and might just be insane, but it's one of the ways I try to figure out what's going on in various algorithms)Looking forward to reading more about this in the future.
Thanks for doing this research and sharing the results.I'm curious if you or MIRI plan to do more of this kind of survey research in the future, or its just a one-off project.
Clarifying Q: Does mesa-optimization refer to any inner optimizer, or one that is in particular not aligned with the outer context?
Epistemic status: I’m not really an expert at NLP. I’ve only been working on language modeling for ~8mo, which is much less than some of the folks here, and this is based on my experiences.
Beam search with large unsupervised generatively pretrained transformers (GPTs) is weirder than it appears in the NLP literature. Other commenters have mentioned degeneracies, but for me the sticking points for beam search were:
Given these three issues, in my experience it’s been better to just focus on tuning naive sampling, with a few key parameters: temperature, top_p, etc (these are part of the OpenAI API).
Caveat: it’s possible I’m just bad at tuning beam search. It’s possible I’m bad at scholarship and missed the “one key paper” that would make it all clear to me. I would take the above as more of an anecdote than a scientific result.Separation of training and sampling:This has been mentioned by other commenters, but might bear repeating that there is no sampling at all in the training process for GPTs. They’re trained to approximate marginal next token distributions, and the default is to share the loss on the prediction for every token equally. In practice the loss on later tokens is lower.
All of this is saying that training is a separate process for sampling. I think there is probably very good research to be done in better sampling — in particular, I think it is possible to have a machine which aligns sampling from an unaligned model.
Lookahead & pondering:
I think the point about lookahead is still worth considering. One of the differences between transformers and the previous most-popular architecture for language models (LSTMs) is that transformers use the same amount of compute for every token. (It’s possible to build them otherwise, but I haven’t seen any of these that I’ve been impressed by yet)
I think my favorite example of this in the literature is Adaptive Computation Time (ACT)[https://arxiv.org/abs/1603.08983], where essentially the model learns how to “spend” extra compute on certain characters.(One of the things going on with ACT is dealing with the non-uniformity of the distribution of information content in character strings — for GPTs this is at least partially ameliorated by the byte-pair encoding)
So I think it is reasonable to train a model to be able to use extra “pondering” time when sampling. Either by having an external controller that tells the model when to ponder and when to output, or by having the model learn itself how to ponder (which is the “halting neuron” signal in ACT).
I do think that any sort of pondering is subject to mesa-optimization concerns.
Fix 1 - BERT:
Caveat: I haven’t trained BERT models or taken a trained one and tried hard to get high quality samples from it. This is based on intuitions and hearsay.
Here I’ll use “GPT” to refer to autoregressive next token prediction objectives, to mirror the style of the article. This objective can of course be used with other architectures in other settings.
Instead of thinking the “mask-part-out prediction” (BERT) and the “mask future text” (GPT) as two separate tasks, think of them as points in the space of distributions over masks.
In particular, its trivial to come up with mask distributions that include both a preponderance of masks which leave small parts out (BERT-like) and masks which leave future tokens out (GPT-like) as well as possibly other mask patterns.
My intuition is that the higher probability you mask out all future tokens, the easier it is to get high quality samples from that model.
Fix 1 - Editing Text:
(Same caveat as above regarding inexperience w/ BERT models)
BERT objectives by themselves do not allow efficient text editing, and neither do GPT objectives.
Thinking about the task of composing an edit you, the model needs to:
Neither BERT nor GPT objectives do a great job of this by itself. If I had to choose, though, I think you can encode this sort of thing in the GPT dataset and have it autoregressively generate edits.
(This is part of a conjecture I’ve been meaning to writeup for lesswrong of “the dataset is the interface” for GPT models)
Fix 2 - Changing the training:
I think there’s some interesting stuff here, but so far this is in the regime of training algorithms that are unexplored, enormously complex, and poorly understood.The clearest part here is that it uses sampling in the training loop which so far I’ve almost exclusively seen in reinforcement learning (RL).
But, we can probably implement something like this with RL. In particular, training is a process of selecting a context (masking), sampling from the model to fill in the mask, and scoring based on the objective.
In this case, drawing some analogies to RL:
It’s pretty easy to see here that this wouldn’t work well from generating from scratch. If I provide zero contextual tokens to the model, sample N tokens, and then score it on how close it got to a true (hidden) document, I am going to have a very bad time.
This might be a good approach for fine-tuning a GPT model — which is (exactly what some colleagues did)[https://openai.com/blog/fine-tuning-gpt-2/].
Even in the fine-tuning case, we have all of the myriad and sundry problems with RL (instability, inefficiency, etc) that our plain-and-simple language modeling objective lacks.
Fix 2 - update away:
I think this probably won’t work just from experience. I’ve found it very hard to get the model to “reduce your probability on the most likely outcome and increase your probability on the next most likely outcome” — instead objectives like this tend to just increase the temperature of everything (or worse, it puts all of the increase in entropy in the long tail of bad answers).
It’s possible there is a good way to do this, but for now I don’t know of a good way to get a model to increase the probability of “secondary options” without just degenerating into increasing entropy.
Fix 2 - track updates:
If I understand this correctly, I think this is easily approximated by having an objective/loss/reward term which penalizes differences from the original model. For small deltas I think this is a good approach, and unfortunately is only as good as the original model you’re comparing it too.
As far as the specific proposal for managing updates towards/away from beam search updates, that seems also possible via a similar mechanism — penalize distributional difference from those samples.
I think we haven’t really explored these sort of penalties enough, and in particular how they interact when combined with other objectives.
Fix 2 - will it stunt:
I think that any objective that scores better predictions higher will incentivize some sort of lookahead/pondering.
If you prevent it from being coincident with the beam search distribution, then I expect the model will learn how to do lookahead/pondering in the null space of beam search.
Will these solve mesa-optimization:
This isn’t clear to me, but I think it’s worth studying.
In particular, it would be good to figure out some way of contriving a mesa-optimization setup, such that we could measure if these fixes would prevent it or not.
Beam Search in the API:
I think my above comments about Beam Search apply here.
Beam search, like any optimization algorithm, is hugely dependent on its scoring function. If you score on likelihood, you’ll end up with high-likelihood (“unsurprising”) text.
Future thoughts - sampling research:
I think in general we’re in a weirdly asymmetric world, where we have a huge amount of compute and effort into computing auto-regressive next token distributions, and comparatively very little sophistication in sampling from them.
This comment is probably too long already for me to expand too much on this, but in particular, I think the log-likelihood objective is default unaligned (as most datasets are default unaligned) but I think we can find ways of sampling from log-likelihood optimized models in ways that are aligned.
I work at OpenAI on safety. In the past it seems like theres a gap between what I'd consider to be alignment topics that need to be worked on, and the general consensus for this forum. A good friend poked me to write something for this so here I am.
Topics w/ strategies/breakdown:
Topics not important enough to make it into my first 30 minutes of writing:
Anti topics: Things I would have put on here a year ago
I’m available tomorrow to chat about these w/ the group. Happy to talk then (or later, in replies here) about any of these if folks want me to expand further.