Oliver Sourbut
  • Autonomous Systems @ UK AI Safety Institute (AISI)
  • DPhil AI Safety @ Oxford (Hertford college, CS dept, AIMS CDT)
  • Former senior data scientist and software engineer + SERI MATS

I'm particularly interested in sustainable collaboration and the long-term future of value. I'd love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.

I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read - let me know your suggestions! In no particular order, here are some I've enjoyed recently

  • Ord - The Precipice
  • Pearl - The Book of Why
  • Bostrom - Superintelligence
  • McCall Smith - The No. 1 Ladies' Detective Agency (and series)
  • Melville - Moby-Dick
  • Abelson & Sussman - Structure and Interpretation of Computer Programs
  • Stross - Accelerando
  • Graeme - The Rosie Project (and trilogy)

Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites

  • Hanabi (can't recommend enough; try it out!)
  • Pandemic (ironic at time of writing...)
  • Dungeons and Dragons (I DM a bit and it keeps me on my creative toes)
  • Overcooked (my partner and I enjoy the foody themes and frantic realtime coordination playing this)

People who've got to know me only recently are sometimes surprised to learn that I'm a pretty handy trumpeter and hornist.


Breaking Down Goal-Directed Behaviour

Wiki Contributions


This is great, and thanks for pointing at this confusion, and raising the hypothesis that it could be a confusion of language! I also have this sense.

I'd strongly agree that separating out 'deception' per se is importantly different from more specific phenomena. Deception is just, yes, obviously this can and does happen.

I tend to use 'deceptive alignment' slightly more broadly - i.e. something could be deceptively aligned post-training, even if all updates after that point are 'in context' or whatever analogue is relevant at that time. Right? This would be more than 'mere' deception, if it's deception of operators or other-nominally-in-charge-people regarding the intentions (goals, objectives, etc) of the system. Also doesn't need to be 'net internal' or anything like that.

I think what you're pointing at here by 'deceptive alignment' is what I'd call 'training hacking', which is more specific. In my terms, that's deceptive alignment of a training/update/selection/gating/eval process (which can include humans or not), generally construed to be during some designated training phase, but could also be ongoing.

No claim here to have any authoritative ownership over those terms, but at least as a taxonomy, those things I'm pointing at are importantly distinct, and there are more than two of them! I think the terms I use are good.

FWIW my take is that the evolution-ML analogy is generally a very excellent analogy, with a bunch of predictive power, but worth using carefully and sparingly. Agreed that sufficient detail on e.g. DL specifics can screen off the usefulness of the analogy, but it's very unclear whether we have sufficient detail yet. The evolution analogy was originally supposed to point out that selecting a bunch for success on thing-X doesn't necessarily produce thing-X-wanters (which is obviously true, but apparently not obvious enough to always be accepted without providing an example).

I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.

Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is. But brains are definitely doing something like temporal-difference learning, and the overall 'serial depth' thing is also weakly in favour of brains ~= DL vs genomes+selection ~= DL.

I'd love to know what you're referring to by this:

evolution... is fine with a mutation that leads to 10^7 serial ops if it's metabolic costs are low.


Is this a prediction that a cyclic learning rate -- that goes up and down -- will work out better than a decreasing one? If so, that seems false, as far as I know.

I think the jury is still out on this, but there's literature on it (probably much more I haven't fished out). [EDIT: also see this comment which has some other examples]

  1. AFAIK there's no evidence of this and it would be somewhat surprising to find it playing a major role. Then again, I also wouldn't be surprised if it turned out that brains are doing something which is secretly sort of equivalent to gradient descent. ↩︎

The problem is that this advantage can oscillate forever.

This is a pretty standard point in RL textbooks. But the culprit is the learning rate (which you set to be 1 in the example, but you can construct a nonconverging case for any constant )! The advantage definition itself is correct and non-oscillating, it's the estimation of the expectation using a moving average which is (sometimes) at fault.

Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.

I like the philosophical and strategic take here: let's avoid wireheading, arbitrary reinforcement strength is risky[1], hopefully we can get some values-caring-about-human-stuff.

The ACTDE seems potentially a nice complement/alternative to entropy[2] regularisation for avoiding mode collapse (I haven't evaluated deeply). I think you're misdiagnosing a few things though.

Overall I think the section about oscillating advantage/value estimation is irrelevant (interesting, but unrelated), and I think you should point the finger less at PPO and advantage estimation per se and more at exploration at large. And you might want to flag that too much exploration/randomness can also be an issue!

  1. Though note that ideally, once we actually know with confidence what is best, we should be near-greedy about it, rather than softmaxing! Say it was 'ice cream' vs 'slap in the face'. I would infinitely (linearly in time) regret softmaxing over that for eternity. As it stands I think humanity is very far from being able to safely aggressively greedily optimise really important things, but this is at least a consideration to keep in mind. ↩︎

  2. Incidentally, KL divergence regularisation is not primarily for avoiding mode collapse AFAIK, it's for approximate trust region constraints - which may incidentally help to avoid mode collapse by penalising large jumps away from initially-high-entropy policies. See the TRPO paper. Entropy regularisation directly addresses mode collapse. ↩︎

Strong agree with the need for nuance. 'Model' is another word that gets horribly mangled a lot recently.

I think the more sensible uses of the word 'agent' I've come across are usually referring to the assemblage of a policy-under-training plus the rest of the shebang: learning method, exploration tricks of one kind or another, environment modelling (if any), planning algorithm (if any) etc. This seems more legit to me, though I still avoid using the word 'agent' as far as possible for similar reasons (discussed here (footnote 6) and here).

Similarly to Daniel's response to 'reward is not the optimization target' I think you can be more generous in your interpretation of RL experts' words and read less error in. That doesn't mean that more care in communication and terminology would be preferable, which is a takeaway I strongly endorse.

Really enjoyed this post, both aesthetically (I like evolution and palaeontology, and obviously AI things!) and as a motivator for some lines of research and thought.

I had a go at one point connecting natural selection with gradient descent which you might find useful depending on your aims.

I also collected some cases of what I think are potentially convergent properties of 'deliberating systems', many of them natural, and others artificial. Maybe you'll find those useful, and I'd love to know to what extent you agree or disagree with the concepts there.

This was a great read. Thanks in particular for sharing some introspection on motivation and thinking processes leading to these findings!

Two thoughts:

First, I sense that you're somewhat dissatisfied with using total variation distance ('average action probability change') as a qualitative measure of the impact of an intervention on behaviour. In particular, it doesn't weight 'meaningfulness', and important changes might get washed out by lots of small changes in unimportant cells. When we visualise, I think we intuitively do something richer, but in order to test at scale, visualisation becomes a bottleneck, so you need something quantitative like this. Perhaps you might get some mileage by considering the stationary distribution of the policy-induced Markov chain? It can be approximated by multiplying the transition matrix by itself a few times! Obviously that matrix is technically quadratic size in state count, but it's also very sparse :) so that might be relatively tractable given that you've already computed a NN forward pass for each state by to get to this point. Or you could eigendecompose the transition matrix.

Second, this seems well-informed to me, but I can't really see the connection to (my understanding of) shard theory here, other than it being Team Shard! Maybe that'll be clearer in a later post.

I think Quintin[1] is maybe alluding to the fact that in the limit of infinite counterfactual exploration then sure, the gradient in sample-based policy gradient estimation will push in that direction. But we don't ever have infinite exploration (and we certainly don't have counterfactual exploration; though we come very close in simulations with resets) so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).

This seems right to me and it's a nuance I've raised in a few conversations in the past. On the other hand kind of half the point of RL optimisation algorithms is to do 'enough' exploration! And furthermore (as I mentioned under Steven's comment) I'm not confident that such simplistic RL is the one that will scale to AGI first. cf various impressive results from DeepMind over the years which use lots of shenanigans besides plain old sample-based policy gradient estimation (including model-based lookahead as in the Alpha and Mu gang). But maybe!

  1. This is a guess and I haven't spoken to Quintin about this - Quintin, feel free to clarify/contradict ↩︎

  1. Information inaccessibility is somehow a surmountable problem for AI alignment (and the genome surmounted it),
  2. The genome solves information inaccessibility in some way we cannot replicate for AI alignment, or
  3. The genome cannot directly address the vast majority of interesting human cognitive events, concepts, and properties. (The point argued by this essay)

In my opinion, either (1) or (3) would be enormous news for AI alignment

What do you mean by 'enormous news for AI alignment'? That either of these would be surprising to people in the field? Or that resolving that dilemma would be useful to build from? Or something else?

FWIW from my POV the trilemma isn't, because I agree that (2) is obviously not the case in principle (subject to enough research time!). And I further think it reasonably clear that both (1) and (3) are true in some measure. Granted you say 'at least one' must be true, but I think the framing as a trilemma suggests you want to dismiss (1) - is that right?

I'll bite those bullets (in devil's advocate style)...

  • I think about half of your bullets are probably (1), except via rough proxies (power, scamming, family, status, maybe cheating)
    • why? One clue is that people have quite specific physiological responses to some of these things. Another is that various of these are characterised by different behaviour in different species.
    • why proxies? It stands to reason, like you're pointing out here, it's hard and expensive to specify things exactly. Further, lots of animal research demonstrates hardwired proxies pointing to runtime-learned concepts
  • Sunk cost, framing, and goal conflation smell weird to me in this list - like they're the wrong type? I'm not sure what it would mean for these to be 'detected' and then the bias 'implemented'. Rather I think they emerge from failure of imagination due to bounded compute.
    • in the case of goals I think that's just how we're implemented (it's parsimonious)
      • with the possible exception of 'conscious self approval' as a differently-typed and differently-implemented sole terminal goal
      • other goals at various levels of hierarchy, strength, and temporal extent get installed as we go
  • ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions
    • tentatively, I expect cells and atoms probably have similar representation to ghosts and spirits and numbers and ecosystems and whatnot - they're just abstractions and we have machinery which forms and manipulates them
      • admittedly this machinery is basically magic to me at this point
  • wireheading and reality/non-reality are unclear to me and I'm looking forward to seeing where you go with it
    • I suspect all imagined circumstances ('real' or non-real) go via basically the same circuitry, and that 'non-real' is just an abstraction like 'far away' or 'unlikely'
      • after all, any imagined circumstances is non-real to some extent

Another aesthetic similarity which my brain noted is between your concept of 'information loss' on inputs for layers-which-discriminate and layers-which-don't and the concept of sufficient statistics.

A sufficient statistic is one for which the posterior is independent of the data , given the statistic

which has the same flavour as

In the respective cases, and are 'sufficient' and induce an equivalence class between s

Load More