Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

One parallel case that occurs to me is Anthropic using their GPT-like to try to imitate the COMPAS prediction of criminal offending, which is a regression problem too; then in the appendix, they experiment with a movie recommender system:

Figure 8 shows that language models smoothly decrease in the standard Root Mean Square Error (RMSE, lower is better) metric on the widely used Movielens 1M movie recommendation system task [31] as they increase in size. The smallest model achieves a significantly better RMSE (1.06) than chance (RMSE 1.91), and the largest model achieves a significantly lower RMSE (0.94) than a strong baseline model (RMSE 0.98, see below for further details). Although no models achieve state of the art (SOTA) performance (RMSE 0.82), these results are still surprising because the language models (in our zero-shot setting) see two orders of magnitude less training data than the SOTA model.

Training Trace Priors

which is why I prefer the version that wants to see "Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N."

Yeah, I skipped over that because I don't see how one would implement that. That doesn't sound very differentiable? Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function? Even if you have some differentiable trick for that, it's easier to explain my objections concretely with 50%. But I don't have anything further to say about that at the moment.

Setting that aside: I think what this highlights is that the translation from "a prior over circuits" to "a regularizer for NN's" is pretty nontrivial, and things that are reasonably behaved in one space can be very bad in the other

Absolutely. You are messing around with weird machines and layers of interpreters, and simple security properties or simple translations go right out the window as soon as you have anything adversarial or optimization-related involved.

Training Trace Priors

What work is it doing, if it always outputs "this is a dog"?

My point is that, like in the AI koan, a random circuit, or a random NN, still does something. Like, if you feed in your dog photos, it'll start off predicting 1% for this one, 25.78% for that one, 99.76% for this other one... This is just because it is filled with random parameters at initialization and when you feed in your photos, each neuron computes something. Something totally nonsensical, but something nonetheless, and during that something, each neuron will have a distribution of activations which will almost surely not exactly equal 50% and not be independent of every other neuron. Thus, your NN is born steeped deep in sin from the perspective of the regularizer. Of course it could be replaced by a single wire, but 'replace all the parameters of a large complex NN with a single constant wire in a single step' is not an operation that SGD can do, so it won't. (What will it compute after it finally beats the regularizer and finds a set of parameters which will let SGD reduce its loss while still satisfying the regularization constraints? I'm not sure, but I bet it'd look like a nasty hash-like mess, which simply happens to be independent of its input on average.)

Training Trace Priors

I see. I guess I would then say a broader concern with this sort of regularization approach is that it incentivizes the network to move towards networks which are made up of a highly distributed representation or one which very easily permutes its weights (both of which are things that happen already with no particular incentive), right from the start, not because it is traveling towards a deceptive network - it's far too stupid and unoptimized for deception to even be an option at initialization - but because this sort of regularization impedes normal learning.

You hint at this with the modularity section, but I think you get that problem even without modularity. Let's say that we are training a dog classifier, with no cats at all, and therefore modularity and deception cannot be involved even in theory; it should learn to output a probability of P(dog)=1, right? That should be super-easy, shouldn't it? But how will it do so when it has been initialized with a large number of random parameters which mean that dog photo by dog photo (each one radically different in pixel space, and translating to radically different embeddings after passing through the naive random initialization), it will have very different magnitude output and layer by layer or neuron by neuron activation distributions, and to update it to get closer to the right answer, it must inhibit various neurons from activating, activate others, work around 'dead neurons' which don't fire, and so on, all of which are in violent conflict with a regularizer forcing every neuron be activated half the time? If you enforce your desired correlations hard enough, it may not learn at all; if it does learn, it seems entirely possible to me that it may ignore the task entirely until it has finally bounced around into an area of model space where SGD can make P consistently increase towards 1 without the regularizer instantly undoing its work, because it found some sort of permutation or scramble which maintain the desired correlation patterns at the neuron level, 'fooling' the regularizer from your perspective, while updating the broader distributed representation and finally then solving the task. (Similar to grokking or patient teacher or wide basins or meta-learning approaches involving gradient descent.) Since beating the regularizer is a price that must be paid equally by deceptive and non-deceptive algorithms in this scenario, it has already happened by the time nontrivial learning begins, it no longer penalizes deception at all.

Training Trace Priors

I'm not sure branch coverage metrics are not easily beaten. I'm reminded of the Day & Night CA which is Turing-complete yet completely symmetric, or reversible computing like flow models. Or think of interpreters: at the level of interpreter operations, a malicious program can use the same number and quantity of operations as a good one, and may well have to if it's doing things like weird machines or return-oriented programming - if you build your hack out of gadgets found in code already installed on the system like Firefox, then it's going to look very similar, if not identical, to running Firefox (that's kinda the point).

Let's say I want to create a NN where I can encode arbitrary computations, including deceptive ones, which satisfy the property of 'every neuron fires half the time during training'. Aren't there lots of ways to do this? For example, just off the top of my head, I could have an input layer which 'chooses' one of two branches at random; the branches mirror the NN and have one half's neurons be the opposite of the other half, and at the output layer, throw away the opposite half's output; by construction, my activation statistics will now be exactly 50%: if a working neuron fires while computing an example, its mirror will not, and vice-versa, and they each run half the time. This can now represent anything that half the weights can represent, and can sculpt any distribution of activations you want, I think. This an explicit construction but it could be implicit, like if it learns to repeat the input and masks out half the features, perhaps using some extremely variable property like the invisible-to-humans non-robust features classifiers exploit as its RNG. I'm sure it can be made more efficient too, if we allow the branches to swap computations carefully: if you can split any computation, I think you could just keep pulling the same trick (just keep splitting each branch in half).

PaLM in "Extrapolating GPT-N performance"

The Chinchilla scaling laws would predict faster progress.

(But we wouldn't observe that on these graphs because they weren't trained Chinchilla-style, of course.)

Deepmind's Gato: Generalist Agent

The two major points I take away:

  1. Scaling Just Works: as blase as we may now be at seeing 'lines go straight', I continue to be shocked in my gut that they do just keep going straight and something like Gato can be as straightforward as 'just train a 1.2b-param Transformer on half a thousand different tasks, homes, nbd' and it works exactly like you'd think and the scaling curve looks exactly like you'd expect. It is shocking how unshocking the results are conditional on a shocking thesis (the scaling hypothesis). So many S-curves and paradigms hit an exponential wall and explode, but DL/DRL still have not. We should keep this in mind that every time we have an opportunity to observe scaling explode in a giant fireball, and we don't.

  2. Multi-task learning is indeed just another blessing of scale: as they note, it used to be that learning multiple Atari games in parallel was really hard. It did not work, at all. You got negative transfer even within ALE. People thought very hard and ran lots of experiments to try to create things like Popart less than 4 years ago where it was a triumph that, due to careful engineering a single checkpoint could play just the ALE-57 games with mediocre performance.

    Decision Transformer definitely made 'multi-task learning is a blessing of scale' the default hypothesis, but no one had actually shown this, the past DT and other work (aside from MetaMimic) were all rather low n and k; you could wonder if they would interfere at a certain point or break down, and require fancy engineering like MoEs to enable learning at all. (Similarly, Andy Jones showed nice scaling laws for DRL and I scraped together a few examples like Ms Pacman, but nothing across really complicated tasks or many tasks.)

    Now you can throw in not just ALE, but DMLab, Metaworld, Procgen, hell, let's just throw in a bunch of random Internet scraped text and images and captions and treat those as 'reinforcement learning tasks' too why not, and to make them all play together you do... nothing, really, you just train on them all simultaneously with a big model in about the dumbest way possible and it works fine.

(Also, if one had any doubts, DM is now fully scale-pilled.)

Redwood Research’s current project

You should want to label and train on snippets that your classifier thinks is 50% correct, because that is how you maximmise information.

You don't want to 'maximize information' (or minimize variance). You want to minimize the number of errors you make at your decision-threshold. Your threshold is not at 50%, it's at 99%. Moving an evil sample from 50% to 0% is of zero intrinsic value (because you have changed the decision from 'Reject' to 'Reject' and avoided 0 errors). Moving an evil sample from 99.1% to 98.9% is very valuable (because you have changed the decision from 'Accept' to 'Reject' and avoided 1 error). Reducing the error on regions of data vastly far away from the decision threshold, such as deciding whether a description of a knifing is ever so slightly more 'violent' than a description of a shooting and should be 50.1% while the shooting is actually 49.9%, is an undesirable use of labeling time.

Gradient hacking presents an OR-like construction which shields from gradients too, apparently, which might be of interest.

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

I said it was an analogy. You were discussing what intelligent human-level entities with inhibition control problems would hypothetically look like; well, as it happens, we do have such entities, in the form of sociopaths, and as it happens, they do not simply explode in every direction due to lacking inhibitions but often perform at high levels manipulating other humans until suddenly then they explode. This is proof of concept that you can naturally get such streaky performance without any kind of exotic setup or design. Seems relevant to mention.

Load More