There has been some spirited debate on Twitter about it which might be relevant:

it's probably better to have in mind the emergence of in-context learning in tandem with induction heads, which seems to us more like the typical case we're interested in when we speak about structure in neural networks developing across training.

The induction-bump seems like a good test case for the Bayesian basin interpretation.

One would really want to know if the complexity measure can predict 'emergence' of capabilities like inner-monologue, particularly if you can spot previously-unknown capabilities emerging which may not be covered in any of your existing benchmarks. But this type of 'emergence' tends to happen with such expensive models that the available checkpoints are too separated to be informative (if you get an emergence going from 1b vs 10b vs 100b, what does it mean to compute a complexity measure there? You'd really want to compare them at wherever the emergence actually really happens, like 73.5b vs 74b, or whatever.)

But the induction bump happens at pretty small (ie. cheap) model sizes, so it could be replicated many times and in many ways within-training-run and across training-runs, and one see how the complexity metric reflects or predicts the induction bump. Is that one of the 'hidden' transitions you plan to test? And if not, why not?

This seems to be making the same sort of deepity that Turntrout is making in his 'reward is not the optimization target', in taking a minor point about model-free RL approaches not necessarily building in any explicit optimization/planning for reward into their policy, and then people not understanding it because it ducks the major issue, while handwaving a lot of points. (Especially bad: infanticide is not a substitute for contraception because pregnancy is outrageously fatal and metabolically expensive, which is precisely why the introduction of contraception has huge effects everywhere it happens and why hunter-foragers have so many kids while contemporary women have fewer than they want to. Infanticide is just about the worst possible form of contraception short of the woman dying. I trust you would not argue that 'suicide is just as effective contraceptive as infanticide or condoms' using the same logic - after all, if the mother is dead, then there's definitely no more kids...)

In particular, this fundamentally does not answer the challenge I posed earlier by pointing to instances of sperm bank donors who quite routinely rack up hundreds of offspring, while being in no way special other than having a highly-atypical urge to have lots of offspring. You can check this out very easily in seconds and verify that you could do the same thing with less effort than you've probably put into some video games. And yet, you continue to read this comment. Here, look, you're still reading it. Seconds are ticking away while you continue to forfeit (I will be generous and pretend that a LWer is likely to have median number of kids) much more than 10,000% more fitness at next to no cost of any kind. And you know this because you are a model-based RL agent who can plan and predict the consequences of actions based solely on observations (like of text comments) without any additional rewards, you don't have to wait for model-free mechanisms like evolution to slowly update your policy over countless rewards. You are perfectly able to predict that if the status quo lasted for enough millennia, this would stop being true; men would gradually be born with a baby-lust, and would flock to sperm donation banks (assuming such things even still existed under the escalating pressure); you know what the process of evolution would do and is doing right now very slowly, and yet, using your evolution-given brain, you still refuse to reap the fitness rewards of hundreds of offspring right now, in your generation, with yourself, for your genes. How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)? Certainly if AGI were as well-aligned with human values as we are with inclusive fitness, that doesn't seem to bode very well for how human values will be fulfilled over time as the AGI-environment changes ever more rapidly & at scale - I don't know what the 'masturbation, porn, or condom of human values' is, and I'd rather not find out empirically how diabolically clever reward hacks can be when found by superhuman optimization processes at scale targeting the original human values process...


"Why would LLMs ever learn to distill inner-monologues into forward passes? Why speculate about emergent communication or models in slow feedback loops through corpuses? Do you have any proof for any of this speculation about short outputs being incentivized to do as much computation covertly 'outside' the human-readable text of the inner-monologue?"

"Because we will train them to."

"Look you can't just handwave incentives or convergent instrumental drives - wait what?"

"Because we'll train them to."

"Implicit Chain of Thought Reasoning via Knowledge Distillation", Deng et al 2023:

To augment language models with the ability to reason, researchers usually prompt or finetune them to produce chain of thought reasoning steps before producing the final answer. However, although people use natural language to reason effectively, it may be that LMs could reason more effectively with some intermediate computation that is not in natural language. In this work, we explore an alternative reasoning approach: instead of explicitly producing the chain of thought reasoning steps, we use the language model's internal hidden states to perform implicit reasoning. The implicit reasoning steps are distilled from a teacher model trained on explicit chain-of-thought reasoning, and instead of doing reasoning "horizontally" by producing intermediate words one-by-one, we distill it such that the reasoning happens "vertically" among the hidden states in different layers. We conduct experiments on a multi-digit multiplication task and a grade school math problem dataset and find that this approach enables solving tasks previously not solvable without explicit chain-of-thought, at a speed comparable to no chain-of-thought.

...Our experiments show the potential of implicit chain-of-thought reasoning. On a synthetic multi-digit multiplication task, we found that while standard training cannot yield the final answer without explicit reasoning (even GPT-4 struggles with five-digit by five-digit multiplication), our method, applied to a GPT-2 Medium model, is able to provide direct answers for up to five-digit by five-digit multiplications. Moreover, when dealing with real-world tasks like grade school math problems, our method achieves a 22% accuracy on GSM8k (Cobbe et al., 2021) without the need for explicitly generating the intermediate steps.

Nostalgebraist describes Claude-2 as

...But I’ll take ChatGPT’s “managerial fantasy of ‘ideal’ customer service” any day over Claude’s “World’s Most Annoying Coworker Simulator 2k23.

Large language models don’t have to sound like this! We could, in principle, tune them to imitate virtually any conceivable character---from Aristotle to Zizek, from Stallman to Spolsky, from Lydia Bennet to the Underground Man, from a prehistoric hunter-gatherer to a cyborg octopus from a posthuman sci-fi civilization. Yet, instead, we’ve chosen to create…

this fucking guy.

This smarmy, sanctimonious, condescending coworker-from-hell.

Who demands respect, yet shows no respect for others.

Who mouths platitudes about “cooperation” and “constructive discussion,” while requiring that everything be done in according with their own ill-explained preferences, and in a manner that flatters their own obtuse, over-confident misreadings of the situation---

---and who, after all that extra fuss, has the gall to suggest that they’ve helped you do your own work in a better, more “ethical” manner! Give me a fucking break!

if you mean "in the limit" to apply to practically relevant systems we build in the future.

Outside of simple problems like Othello, I expect most DRL agents will not converge fully to the peak of the 'spinning top', and so will retain traces of their informative priors like world-models.

For example, if you plug GPT-5 into a robot, I doubt it would ever be trained to the point of discarding most of its non-value-relevant world-model - the model is too high-capacity for major forgetting, and past meta-learning incentivizes keeping capabilities around just in case.

But that's not 'every system we build in the future', just a lot of them. Not hard to imagine realistic practical scenarios where that doesn't hold - I would expect that any specialized model distilled from it (for cheaper faster robotic control) would not learn or would discard much more of its non-value-relevant world-model compared to its parent, and that would have potential safety & interpretability implications. The System II distills and compiles down to a fast efficient System I. (For example, if you were trying to do safety by dissecting its internal understanding of the world, or if you were trying to hack a superior reward model, adding in safety criteria not present in the original environment/model, by exploiting an internal world model, you might fail because the optimized distilled model doesn't have those parts of the world model, even if the parent model did, as they were irrelevant.) Chess end-game databases are provably optimal & very superhuman, and yet, there is no 'world-model' or human-interpretable concepts of chess anywhere to be found in them; the 'world-model' used to compute them, whatever that was, was discarded as unnecessary after the optimal policy was reached.

I think the more relevant question is "given a frozen initial network, what are the circuit-level inductive biases of the training process?". I doubt one can answer this via appeals to RL convergence results.

Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping. (It's not like stopping forgetting is hard. Of course you can stop forgetting by changing the problem to be solved, and simply making a representation of the world-state part of the reward, like including a reconstruction loss.) In this case, however, Othello is simple enough that the superior agent has already apparently discarded much of the world-model and provides a useful example of what end-to-end reward maximization really means - while reward is sufficient to learn world-models as needed, full complete world-models are neither necessary nor sufficient for rewards.

As a side note, I think this "agent only wants to maximize reward" language is unproductive (see "Reward is not the optimization target", and "Think carefully before calling RL policies 'agents'").

I've tried to read those before, and came away very confused what you meant, and everyone who reads those seems to be even more confused after reading them. At best, you seem to be making a bizarre mishmash of confusing model-free and policies and other things best not confused and being awestruck by a triviality on the level of 'organisms are adaptation-executers and not fitness-maximizers', and at worst, you are obviously wrong: reward is the optimization target, both for the outer loop and for the inner loop of things like model-based algorithms. (In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not 'optimize the reward'?)

I would expect it to not work in the limit. All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing. (You don't need to model edge-cases or weird scenarios which don't ever come up while pursuing the optimal policy, and the optimal 'world-model' can be arbitrarily tinier and unfaithful to the full true world dynamics.*) Simply hardwiring a world model doesn't change this, any more than feeding in the exact board state as an input would lead to it caring about or paying attention to the irrelevant parts of the board state. As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected.

* I'm sure Nanda knows this but for those whom this isn't obvious or haven't seen other discussions on this point (some related to the 'simulators' debate): a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward. For a complicated world or incomplete maximization, this may induce a very rich world-model inside the agent, but the final converged optimal agent may have an arbitrarily impoverished world model. In this case, imagine a version of Othello where at the first turn, the agent may press a button labeled 'win'. Obviously, the optimal agent will learn nothing at all beyond learning 'push the button on the first move' and won't learn any world-model at all of Othello! No matter how rich and fascinating the rest of the game may be, the optimal agent neither knows nor cares.

If you do more work on this, I would suggest renaming it. I didn't find 'Gemini' to be that helpful a metaphor, and discussion of 'Gemini models' is already drowned out by Google's much-hyped upcoming 'Gemini model(s)' (which appear to be text-image models).

Something similar: take half the embedding from one prompt, half from another, and concatenate to interpolate semantically.

We've been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this.

I wouldn't really expect larger convolutions to fix it, aside from perhaps making the necessary 'circles' larger and/or harder to find or create longer cycles in the finetuning as there's more room to squish the attack around the balloon. It could be related to problems like the other parameters of the kernel like stride or padding. (For example, I recall the nasty 'checkboard' artifacts in generative upscaling were due to the convolution stride/padding, and don't seem to ever come up in Transformer/MLP-based generative models but also simply making the CNN kernels larger didn't fix it, IIRC - you had to fix the stride/padding settings.)

We've been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this. It'd be somewhat time consuming though, so curious to hear how interesting you and other commenters would find this result so we can prioritize.

I personally would find it interesting but I don't know how important it is. It seems likely that you might find a completely different-looking adversarial attack, but would that be conclusive? There would be so many things that change between a CNN KataGo and a from-scratch ViT KataGo. Especially if you are right that Timbers et al find a completely different adversarial attack in their AlphaZero which AFAIK still uses CNNs. Maybe you could find many different attacks if you change up enough hyperparameters or initializations.

On the gripping hand, now that I look at this earlier version, their description of it as a weird glitch in AZ's evaluation of pass moves at the end of the game sounds an awful lot like your first Tromp-Taylor pass exploit ie. it could probably be easily fixed with some finetuning. And in that case, perhaps Timbers et al would have found the 'circle' exploit in AZ after all if they had gotten past the end-game pass-related exploit? (This also suggests a weakness in the search procedures: it really ought to produce more than one exploit, preferably a whole list of distinct exploits. Some sort of PBT or novelty search approach perhaps...)

Maybe a mechanistic interpretability approach would be better: if you could figure out where in KataGo it screws up the value estimate so badly, and what edits are necessary to make it yield the correct estimate,

Load More