early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway

So... Sydney?


What do you think of the features found? They seem to work, given your causal manipulations, but looking over them, they seem... very superficial. Like penalizing URLs per se doesn't seem like a great thing to have in a reward model. (A LLM has typically memorized or can guess lots of useful URLs.) It doesn't match my own 'preferences', as far as I can tell, and so is doing a bad job at 'preference learning'.

Is this an artifact of SAE being able to express only simple linear features by design and there are more complex features or computations which yield more appropriate responses to compensate for the initial simple frugal heuristics, or is this kind of preference learning really just that dumb? (I'm reminded of the mode collapse issues like rhyming poetry. If you did SAE on the reward model for ChatGPT, would you find a feature as simple as 'rhyming = gud'?)


Ben Adlam (via Maloney et al 2022) makes an interesting point: if you plot parameters vs training data, it's a nearly perfect 1:1 ratio historically. (He doesn't seem to have published anything formally on this.)


We know that "AI is whatever doesn't work yet". We also know that people often contrast AI (or DL, or LLMs specifically) derogatorily with classic forms of software, such as regexps: why use a LLM to waste gigaflops of compute to do what a few good regexps could...?

So I am amused to discover recently, by sheer accident while looking up 'what does the "regular" in "regular expression" mean, anyway?', that it turns out that regexps are AI. In fact, they are not even GOFAI symbolic AI, as you immediately assumed on hearing that, but they were originally connectionist AI research! Huh?

Well, it turns out that 'regular events' were introduced by Kleene himself with the justification of modeling McCulloch-Pitts neural nets! (Which are then modeled by 'regular languages' and conveniently written down as 'regular expressions', abbreviated to 'regexps' or 'regexes', and then extended/bastardized in countless ways since.)

The 'regular' here is not well-defined, as Kleene concedes, and is a gesture towards modeling 'regularly occurring events' (that the neural net automaton must process and respond to). He admits "regular" is a terrible term, but no one came up with anything better, and so we're stuck with it.


Masayoshi Son reflects on selling Nvidia in order to maintain ownership of ARM etc: "Let's stop talking about this, I just get sad."


I speculated last time that convolutions might be inherently vulnerable; looks like I was wrong, although the ViT attack there does look distinctly different (so maybe in some sense that circle attack was convolutional)

I am not sure what is up with the 'gift' or 'atari' attacks, but the overall appearance of the attacks looks more like a horizon blindspot problem: they look like they push out the collapse enough steps that the feedforward network doesn't see any issue. And then perhaps it assigns such low values that the tree search never spots the disaster without an extremely large number of visits? Some qualitative work here on where the model+search goes wrong might be worthwhile now that there are multiple good attacks to compare. Are these all fundamentally the same or do they seem to trigger completely different pathologies?


It concludes that the convention is rare and that it doesn't apply to the wider distribution of problems - because it forgets it saw the weird convention on the wider distribution?

Yes. It doesn't remember that "it saw" the wide distribution originally (because there is no sort of meta-cognition about the training process, it's just prediction within episode), all it knows is it currently thinks the weird convention but that is in conflict with the truth; you then drill it heavily on just a few memorized examples, dropping all the others. This then instead concentrates all evidence on those memorized examples being exceptions to the rule, and the others can snap back to the original distribution as the memorized examples get more parsimoniously explained. (It can't "see" those past examples, remember. It can only see the current training example in context.) This resolves the shifting distributions efficiently. And if you kept doing this with an RNN, say, you'd be creating a serial reversal learning task: "now one is taboo. now the other is taboo. now they're all taboo. now none are taboo."

I wouldn't call this 'catastrophic forgetting' because that would be pretty weird use of language: it 'forgets' by remembering instead the true answers from even earlier in training...? Does a mouse solving a serial reversal learning task by learning to switch rapidly which part of the maze he searches for maze engage in 'catastrophic forgetting' or does he just update online based on noisy observations? (Also, since this is just contradictory information, it is not possible to not 'forget' since the model can't predict the taboo and the non-taboo answer simultaneously: it has to make a decision about which one to reject.)


Hypotheses for what is going on

I don't see mentioned what seems like the obvious explanation: this sort of double descent is not anything amazingly convoluted or involving passwords - the extensive training on a few samples simply causes the model to eventually conclude that those samples are drawn from a different distribution, that it is observing a mixture distribution, and so it infers that the repeats are a unique erroneous distribution which must be memorized as exceptions, but then the rest are the ordinary expected distribution and revert back to the ordinary learned approach.

You might call this the 'arbitrary convention' or 'taboo' hypothesis. Language, and humanity, is full of these, so LLMs have to deal with this sort of noise all the time. Why do we drive on the right (or left)? It's just the convention. Why can we not say the name of 'the Scottish play' the night before the performance? It's just something actors do. Why is there no 13th floor, do buildings use some entirely different number-base than decimal or spooky arithmetic where 10+3 != 13 but 14? No, no, it's just a superstition thing. Why can we not speak the true name, ursa, of the 'brown' ['bear'] creature? It's just a long-forgotten taboo. And so on.

This seems mostly consistent with the tests you report - for example, the more bad samples there are finetuned on, the weaker the effect will be because that more strongly implies that the model has fundamentally misunderstood something (maybe it really is some weird modular arithmetic rather than an arbitrary skipping of a particular number) rather than all those bad samples being taboo.


What do you mean by power here?

Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of "can be real smart and learn complicated stuff" is.


Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).

Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead 'skinny' NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)

Load More