Alex Ray

AI Alignment @ OpenAI


Sorted by New

Wiki Contributions


Parameter counts in Machine Learning

One reason it might not be fitting as well for vision, is that vision has much more weight-tying / weight-reuse in convolutional filters.  If the underlying variable that mattered was compute, then image processing neural networks would show up more prominently in compute (rather than parameters).

Teaching ML to answer questions honestly instead of predicting human answers

I feel overall confused, but I think that's mostly because of me missing some relevant background to your thinking, and the preliminary/draft nature of this.

I hope sharing my confusions is useful to you.  Here they are:

I'm not sure how the process of "spending bits" works.  If the space of possible models was finite and discretized, then you could say spending bits is partitioning down to "1/2^B"th of the space -- but this is not at all how SGD works, and seems incompatible with using SGD (or any optimizer that doesn't 'teleport' through parameter space) as the optimization algorithm.

Spending bits does make sense in terms of naive rejection sampling (but I think we agree this would be intractably expensive) and other cases of discrete optimization like integer programming.  It's possible I would be less confused if this was explained using a different optimization algorithm, like BFGS or some hessian-based method or maybe a black-box bayesian solver.

Separately, I'm not sure why the two heads wouldn't just end up being identical to each other.  In shorter-program-length priors (which seem reasonable in this case; also minimal-description-length and sparse-factor-graph, etc etc) it seems like weight-tying the two heads or otherwise making them identical.

Lastly, I think I'm confused by your big formula for the unnormalized posterior log probability of () -- I think the most accessible of my confusions is that it doesn't seem to pass "basic type checking consistency".

I know the output should be a log probability, so all the added components should be logprobs/in terms of bits.

The L() term makes sense, since it's given in terms of bits.

The two parameter distances seem like they're in whatever distance metric you're using for parameter space, which seems to be very different from the logprobs.  Maybe they both just have some implicit unit conversion parameter out front, but I think it'd be surprising if it were the case that every "1 parameter unit" move through parameter space is worth "1 nat" of information.  For example, it's intuitive to me that some directions (towards zero) would be more likely than other directions.

The C() term has a lagrange multiplier, which I think are usually unitless.  In this case I think it's safe to say it's also maybe doing units conversion.  C() itself seems to possibly be in terms of bits/nats, but that isn't clear.

In normal lagrangian constrained optimization, lambda would be the parameter that gives us the resource tradeoff "how many bits of (L) loss on the data set tradeoff with a single bit (C) of inconsistency"

Finally the integral is a bit tricky for me to follow.  My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space.  That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.

(I realize that unit-type-checking ML is pretty uncommon and might just be insane, but it's one of the ways I try to figure out what's going on in various algorithms)

Looking forward to reading more about this in the future.

"Existential risk from AI" survey results

Thanks for doing this research and sharing the results.

I'm curious if you or MIRI plan to do more of this kind of survey research in the future, or its just a one-off project.

Why GPT wants to mesa-optimize & how we might change this

Clarifying Q: Does mesa-optimization refer to any inner optimizer, or one that is in particular not aligned with the outer context?

Why GPT wants to mesa-optimize & how we might change this

Epistemic status: I’m not really an expert at NLP.  I’ve only been working on language modeling for ~8mo, which is much less than some of the folks here, and this is based on my experiences.

Beam Search:

Beam search with large unsupervised generatively pretrained transformers (GPTs) is weirder than it appears in the NLP literature.  Other commenters have mentioned degeneracies, but for me the sticking points for beam search were:

  • It tends to quickly fall on a modal response — so it’s already bad for any sort of situation you want to generate a diversity of samples and choose the best from
  • It’s hard to correctly score between varying-length segments.  Every paper that uses beam search has some heuristic hack here, which is almost always some parametrized function they pulled from another paper or hacked together.
  • It seems to mostly do best (once tuned) at some narrow/specific distribution (e.g. generating short responses in a chat setting).  It’s hard to get beam search tuned to work well across the full distribution used to train these models (i.e. “text on the internet”)

Given these three issues, in my experience it’s been better to just focus on tuning naive sampling, with a few key parameters: temperature, top_p, etc (these are part of the OpenAI API).

Caveat: it’s possible I’m just bad at tuning beam search.  It’s possible I’m bad at scholarship and missed the “one key paper” that would make it all clear to me.  I would take the above as more of an anecdote than a scientific result.

Separation of training and sampling:

This has been mentioned by other commenters, but might bear repeating that there is no sampling at all in the training process for GPTs.  They’re trained to approximate marginal next token distributions, and the default is to share the loss on the prediction for every token equally.  In practice the loss on later tokens is lower.

All of this is saying that training is a separate process for sampling.  I think there is probably very good research to be done in better sampling — in particular, I think it is possible to have a machine which aligns sampling from an unaligned model.

Lookahead & pondering:

I think the point about lookahead is still worth considering.  One of the differences between transformers and the previous most-popular architecture for language models (LSTMs) is that transformers use the same amount of compute for every token.  (It’s possible to build them otherwise, but I haven’t seen any of these that I’ve been impressed by yet)

I think my favorite example of this in the literature is Adaptive Computation Time (ACT)[], where essentially the model learns how to “spend” extra compute on certain characters.

(One of the things going on with ACT is dealing with the non-uniformity of the distribution of information content in character strings — for GPTs this is at least partially ameliorated by the byte-pair encoding)

So I think it is reasonable to train a model to be able to use extra “pondering” time when sampling.  Either by having an external controller that tells the model when to ponder and when to output, or by having the model learn itself how to ponder (which is the “halting neuron” signal in ACT).

I do think that any sort of pondering is subject to mesa-optimization concerns.

Fix 1 - BERT:

Caveat: I haven’t trained BERT models or taken a trained one and tried hard to get high quality samples from it.  This is based on intuitions and hearsay.

Here I’ll use “GPT” to refer to autoregressive next token prediction objectives, to mirror the style of the article.  This objective can of course be used with other architectures in other settings.

Instead of thinking the “mask-part-out prediction” (BERT) and the “mask future text” (GPT) as two separate tasks, think of them as points in the space of distributions over masks.

In particular, its trivial to come up with mask distributions that include both a preponderance of masks which leave small parts out (BERT-like) and masks which leave future tokens out (GPT-like) as well as possibly other mask patterns.

My intuition is that the higher probability you mask out all future tokens, the easier it is to get high quality samples from that model.

Fix 1 - Editing Text:

(Same caveat as above regarding inexperience w/ BERT models)

BERT objectives by themselves do not allow efficient text editing, and neither do GPT objectives.

Thinking about the task of composing an edit you, the model needs to:

  • Identify the section that will be removed (if any)
  • Figure out the length of the replacement text (if any)
  • Compose the replacement text (if any)
  • Possibly also have some way of attending over the old text, while still knowing to replace it

Neither BERT nor GPT objectives do a great job of this by itself.  If I had to choose, though, I think you can encode this sort of thing in the GPT dataset and have it autoregressively generate edits.

(This is part of a conjecture I’ve been meaning to writeup for lesswrong of “the dataset is the interface” for GPT models)

Fix 2 - Changing the training:

I think there’s some interesting stuff here, but so far this is in the regime of training algorithms that are unexplored, enormously complex, and poorly understood.

The clearest part here is that it uses sampling in the training loop which so far I’ve almost exclusively seen in reinforcement learning (RL).

But, we can probably implement something like this with RL.  In particular, training is a process of selecting a context (masking), sampling from the model to fill in the mask, and scoring based on the objective.

In this case, drawing some analogies to RL:

  • Action - token
  • Action distribution - token distribution (the basic output of a GPT model given an input context)
  • Policy - language model (in particular a GPT model, though with hacks BERT/other models could be used)
  • Reward - objective (log-loss on the true document, for a GPT model)
  • Environment - a document, probably with some starting context already provided

It’s pretty easy to see here that this wouldn’t work well from generating from scratch.  If I provide zero contextual tokens to the model, sample N tokens, and then score it on how close it got to a true (hidden) document, I am going to have a very bad time.

This might be a good approach for fine-tuning a GPT model — which is (exactly what some colleagues did)[].

Even in the fine-tuning case, we have all of the myriad and sundry problems with RL (instability, inefficiency, etc) that our plain-and-simple language modeling objective lacks.

Fix 2 - update away:

I think this probably won’t work just from experience.  I’ve found it very hard to get the model to “reduce your probability on the most likely outcome and increase your probability on the next most likely outcome” — instead objectives like this tend to just increase the temperature of everything (or worse, it puts all of the increase in entropy in the long tail of bad answers).

It’s possible there is a good way to do this, but for now I don’t know of a good way to get a model to increase the probability of “secondary options” without just degenerating into increasing entropy.

Fix 2 - track updates:

If I understand this correctly, I think this is easily approximated by having an objective/loss/reward term which penalizes differences from the original model.  For small deltas I think this is a good approach, and unfortunately is only as good as the original model you’re comparing it too.

As far as the specific proposal for managing updates towards/away from beam search updates, that seems also possible via a similar mechanism — penalize distributional difference from those samples.

I think we haven’t really explored these sort of penalties enough, and in particular how they interact when combined with other objectives.

Fix 2 - will it stunt:

I think that any objective that scores better predictions higher will incentivize some sort of lookahead/pondering.

If you prevent it from being coincident with the beam search distribution, then I expect the model will learn how to do lookahead/pondering in the null space of beam search.

Will these solve mesa-optimization:

This isn’t clear to me, but I think it’s worth studying.

In particular, it would be good to figure out some way of contriving a mesa-optimization setup, such that we could measure if these fixes would prevent it or not.

Beam Search in the API:

I think my above comments about Beam Search apply here.

Beam search, like any optimization algorithm, is hugely dependent on its scoring function.  If you score on likelihood, you’ll end up with high-likelihood (“unsurprising”) text.

Future thoughts - sampling research:

I think in general we’re in a weirdly asymmetric world, where we have a huge amount of compute and effort into computing auto-regressive next token distributions, and comparatively very little sophistication in sampling from them.

This comment is probably too long already for me to expand too much on this, but in particular, I think the log-likelihood objective is default unaligned (as most datasets are default unaligned) but I think we can find ways of sampling from log-likelihood optimized models in ways that are aligned.

What's a Decomposable Alignment Topic?

I work at OpenAI on safety. In the past it seems like theres a gap between what I'd consider to be alignment topics that need to be worked on, and the general consensus for this forum. A good friend poked me to write something for this so here I am.

Topics w/ strategies/breakdown:

  • Fine-tuning GPT-2 from human preferences, to solve small scale alignment issues
    • Brainstorm small/simple alignment failures: ways that existing generative language models are not aligned with human values
    • Design some evaluations or metrics for measuring a specific alignment failure (which lets you measure whether you’ve improved a model or not)
    • Gather human feedback data / labels / whatever you think you can try training on
    • Try training on your data (there are tutorials on how to use Google Colab to fine-tune GPT-2 with a new dataset)
    • Forecast scaling laws: figure out how performance on your evaluation or metric varies with the amount of human input data; compare to how much time it takes to generate each labelled example (be quantitative!)
  • Multi-objective reinforcement learning — instead of optimizing a single objective, optimize multiple objectives together (and some of the objectives can be constraints)
    • What are ways we can break down existing AI alignment failures in RL-like settings into multi-objective problems, where some of the objectives are safety objectives and some are goal/task objectives
    • How can we design safety objectives such that they can transfer across a wide variety of systems, machines, situations, environments, etc?
    • How can we measure and evaluate our safety objectives, and what should we expect to observe during training/deployment?
    • How can we incentivize individual development and sharing of safety objectives
    • How can we augment RL methods to allow transferrable safety objectives between domains (e.g., if using actor critic methods, how to integrate a separate critic for each safety objective)
    • What are good benchmark environments or scenarios for multi-objective RL with safety objectives (classic RL environments like Go or Chess aren’t natively well-suited to these topics)
  • Forecasting the Economics of AGI (turn ‘fast/slow/big/etc’ into real numbers with units)
    • This is more “AI Impacts” style work than you might be asking for, but I think it’s particularly well-suited for clever folks that can look things up on the internet.
    • Identify vague terms in AI alignment forecasts, like the “fast” in “fast takeoff”, that can be operationalized
    • Come up with units that measure the quantity in question, and procedures for measurements that result in those units
    • Try applying traditional economics growth models, such as experience curves, to AI development, and see how well you can get things to fit (much harder to do this for AI than making cars — is a single unit a single model trained? Maybe a single week of a researchers time? Is the cost decreasing in dollars or flops or person-hours or something else? Etc etc)
    • Sketch models for systems (here the system is the whole ai field) with feedback loops, and inspect/explore parts of the system which might respond most to different variables (additional attention, new people, dollars, hours, public discourse, philanthropic capital, etc)

Topics not important enough to make it into my first 30 minutes of writing:

  • Cross disciplinary integration with other safety fields, what will and won’t work
  • Systems safety for organizations building AGI
  • Safety acceleration loops — how/where can good safety research make us better and faster at doing safety research
  • Cataloguing alignment failures in the wild, and create a taxonomy of them

Anti topics: Things I would have put on here a year ago

  • Too late for me to keep writing so saving this for another time I guess

I’m available tomorrow to chat about these w/ the group. Happy to talk then (or later, in replies here) about any of these if folks want me to expand further.