Oliver Sourbut

Call me Oliver or Oly - I don't mind which.

I'm particularly interested in sustainable collaboration and the long-term future of value. I'd love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.

I'm currently (2023) embarking on a PhD in AI in Oxford (Hertford College), and also spend time in (or in easy reach of) London. Until recently I was working as a senior data scientist and software engineer, and doing occasional AI alignment research with SERI.

I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read - let me know your suggestions! In no particular order, here are some I've enjoyed recently

  • Ord - The Precipice
  • Pearl - The Book of Why
  • Bostrom - Superintelligence
  • McCall Smith - The No. 1 Ladies' Detective Agency (and series)
  • Melville - Moby-Dick
  • Abelson & Sussman - Structure and Interpretation of Computer Programs
  • Stross - Accelerando
  • Graeme - The Rosie Project (and trilogy)

Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites

  • Hanabi (can't recommend enough; try it out!)
  • Pandemic (ironic at time of writing...)
  • Dungeons and Dragons (I DM a bit and it keeps me on my creative toes)
  • Overcooked (my partner and I enjoy the foody themes and frantic realtime coordination playing this)

People who've got to know me only recently are sometimes surprised to learn that I'm a pretty handy trumpeter and hornist.


Breaking Down Goal-Directed Behaviour

Wiki Contributions


The problem is that this advantage can oscillate forever.

This is a pretty standard point in RL textbooks. But the culprit is the learning rate (which you set to be 1 in the example, but you can construct a nonconverging case for any constant )! The advantage definition itself is correct and non-oscillating, it's the estimation of the expectation using a moving average which is (sometimes) at fault.

Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.

I like the philosophical and strategic take here: let's avoid wireheading, arbitrary reinforcement strength is risky[1], hopefully we can get some values-caring-about-human-stuff.

The ACTDE seems potentially a nice complement/alternative to entropy[2] regularisation for avoiding mode collapse (I haven't evaluated deeply). I think you're misdiagnosing a few things though.

Overall I think the section about oscillating advantage/value estimation is irrelevant (interesting, but unrelated), and I think you should point the finger less at PPO and advantage estimation per se and more at exploration at large. And you might want to flag that too much exploration/randomness can also be an issue!

  1. Though note that ideally, once we actually know with confidence what is best, we should be near-greedy about it, rather than softmaxing! Say it was 'ice cream' vs 'slap in the face'. I would infinitely (linearly in time) regret softmaxing over that for eternity. As it stands I think humanity is very far from being able to safely aggressively greedily optimise really important things, but this is at least a consideration to keep in mind. ↩︎

  2. Incidentally, KL divergence regularisation is not primarily for avoiding mode collapse AFAIK, it's for approximate trust region constraints - which may incidentally help to avoid mode collapse by penalising large jumps away from initially-high-entropy policies. See the TRPO paper. Entropy regularisation directly addresses mode collapse. ↩︎

Strong agree with the need for nuance. 'Model' is another word that gets horribly mangled a lot recently.

I think the more sensible uses of the word 'agent' I've come across are usually referring to the assemblage of a policy-under-training plus the rest of the shebang: learning method, exploration tricks of one kind or another, environment modelling (if any), planning algorithm (if any) etc. This seems more legit to me, though I still avoid using the word 'agent' as far as possible for similar reasons (discussed here (footnote 6) and here).

Similarly to Daniel's response to 'reward is not the optimization target' I think you can be more generous in your interpretation of RL experts' words and read less error in. That doesn't mean that more care in communication and terminology would be preferable, which is a takeaway I strongly endorse.

Really enjoyed this post, both aesthetically (I like evolution and palaeontology, and obviously AI things!) and as a motivator for some lines of research and thought.

I had a go at one point connecting natural selection with gradient descent which you might find useful depending on your aims.

I also collected some cases of what I think are potentially convergent properties of 'deliberating systems', many of them natural, and others artificial. Maybe you'll find those useful, and I'd love to know to what extent you agree or disagree with the concepts there.

This was a great read. Thanks in particular for sharing some introspection on motivation and thinking processes leading to these findings!

Two thoughts:

First, I sense that you're somewhat dissatisfied with using total variation distance ('average action probability change') as a qualitative measure of the impact of an intervention on behaviour. In particular, it doesn't weight 'meaningfulness', and important changes might get washed out by lots of small changes in unimportant cells. When we visualise, I think we intuitively do something richer, but in order to test at scale, visualisation becomes a bottleneck, so you need something quantitative like this. Perhaps you might get some mileage by considering the stationary distribution of the policy-induced Markov chain? It can be approximated by multiplying the transition matrix by itself a few times! Obviously that matrix is technically quadratic size in state count, but it's also very sparse :) so that might be relatively tractable given that you've already computed a NN forward pass for each state by to get to this point. Or you could eigendecompose the transition matrix.

Second, this seems well-informed to me, but I can't really see the connection to (my understanding of) shard theory here, other than it being Team Shard! Maybe that'll be clearer in a later post.

I think Quintin[1] is maybe alluding to the fact that in the limit of infinite counterfactual exploration then sure, the gradient in sample-based policy gradient estimation will push in that direction. But we don't ever have infinite exploration (and we certainly don't have counterfactual exploration; though we come very close in simulations with resets) so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).

This seems right to me and it's a nuance I've raised in a few conversations in the past. On the other hand kind of half the point of RL optimisation algorithms is to do 'enough' exploration! And furthermore (as I mentioned under Steven's comment) I'm not confident that such simplistic RL is the one that will scale to AGI first. cf various impressive results from DeepMind over the years which use lots of shenanigans besides plain old sample-based policy gradient estimation (including model-based lookahead as in the Alpha and Mu gang). But maybe!

  1. This is a guess and I haven't spoken to Quintin about this - Quintin, feel free to clarify/contradict ↩︎

  1. Information inaccessibility is somehow a surmountable problem for AI alignment (and the genome surmounted it),
  2. The genome solves information inaccessibility in some way we cannot replicate for AI alignment, or
  3. The genome cannot directly address the vast majority of interesting human cognitive events, concepts, and properties. (The point argued by this essay)

In my opinion, either (1) or (3) would be enormous news for AI alignment

What do you mean by 'enormous news for AI alignment'? That either of these would be surprising to people in the field? Or that resolving that dilemma would be useful to build from? Or something else?

FWIW from my POV the trilemma isn't, because I agree that (2) is obviously not the case in principle (subject to enough research time!). And I further think it reasonably clear that both (1) and (3) are true in some measure. Granted you say 'at least one' must be true, but I think the framing as a trilemma suggests you want to dismiss (1) - is that right?

I'll bite those bullets (in devil's advocate style)...

  • I think about half of your bullets are probably (1), except via rough proxies (power, scamming, family, status, maybe cheating)
    • why? One clue is that people have quite specific physiological responses to some of these things. Another is that various of these are characterised by different behaviour in different species.
    • why proxies? It stands to reason, like you're pointing out here, it's hard and expensive to specify things exactly. Further, lots of animal research demonstrates hardwired proxies pointing to runtime-learned concepts
  • Sunk cost, framing, and goal conflation smell weird to me in this list - like they're the wrong type? I'm not sure what it would mean for these to be 'detected' and then the bias 'implemented'. Rather I think they emerge from failure of imagination due to bounded compute.
    • in the case of goals I think that's just how we're implemented (it's parsimonious)
      • with the possible exception of 'conscious self approval' as a differently-typed and differently-implemented sole terminal goal
      • other goals at various levels of hierarchy, strength, and temporal extent get installed as we go
  • ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions
    • tentatively, I expect cells and atoms probably have similar representation to ghosts and spirits and numbers and ecosystems and whatnot - they're just abstractions and we have machinery which forms and manipulates them
      • admittedly this machinery is basically magic to me at this point
  • wireheading and reality/non-reality are unclear to me and I'm looking forward to seeing where you go with it
    • I suspect all imagined circumstances ('real' or non-real) go via basically the same circuitry, and that 'non-real' is just an abstraction like 'far away' or 'unlikely'
      • after all, any imagined circumstances is non-real to some extent

Another aesthetic similarity which my brain noted is between your concept of 'information loss' on inputs for layers-which-discriminate and layers-which-don't and the concept of sufficient statistics.

A sufficient statistic is one for which the posterior is independent of the data , given the statistic

which has the same flavour as

In the respective cases, and are 'sufficient' and induce an equivalence class between s

Regarding your empirical findings which may run counter to the question

  1. Is manifold dimensionality actually a good predictor of which solution will be found?

I wonder if there's a connection to asymptotic equipartitioning - it may be that the 'modal' (most 'voluminous' few) solution basins are indeed higher-rank, but that they are in practice so comparatively few as to contribute negligible overall volume?

This is a fuzzy tentative connection made mostly on the basis of aesthetics rather than a deep technical connection I'm aware of.

Interesting stuff! I'm still getting my head around it, but I think implicit in a lot of this is that loss is some quadratic function of 'behaviour' - is that right? If so, it could be worth spelling that out. Though maybe in a small neighbourhood of a local minimum this is approximately true anyway?

This also brings to mind the question of what happens when we're in a region with no local minimum (e.g. saddle points all the way down, or asymptoting to a lower loss, etc.)

Load More