Senior research scholar at FHI. My current research interests are mainly the behaviour and interactions of boundedly rational agents, complex interacting systems, and strategies to influence the long-term future, with focus on AI alignment.

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Wiki Contributions


Sorry for being snarky, but I think at least some LW readers should gradually notice to what extent is the stuff analyzed here mirroring the predictive processing paradigm, as a different way how to make stuff which acts in the world. My guess is the big step on the road in this direction are not e.g. 'complex wrappers with simulated agents', but reinventing active inference... and also I do suspect it's the only step separating us from AGI, which seems like a good reason why not to try to point too much attention in that way. 

It is not clear to me to what extent this was part of the "training shoulder advisors" exercise, but to me, possibly the most important part of it is to keep the advisors at distance from your own thinking. In particular, in my impression, it seems likely the alignment research has been on average harmed by too many people "training their shoulder Eliezers" and the shoulder advisors pushing them to think in a crude version of Eliezer's ontology. 

The upside of this, or of "more is different" , is we don't necessarily even need the property in the parts, or detailed understanding of the parts.  And how the composition works / what survives renormalization / ... is almost the whole problem.


This seems to be almost exclusively based on the proxies of humans and human institutions. Reasons why this does not necessarily generalize to advanced AIs are often visible when looking from a perspective of other proxies, eg. programs or insects.


So far, progress of ML often led to this pattern:

1. ML models sort of suck, maybe help a bit sometimes. Humans are clearly better ("humans better").
2. ML models get overall comparable to humans, but have different strengths and weaknesses; human+AI teams beat both best AIs alone, or best humans alone ("age of cyborgs")
3. human inputs just mess up with superior AI suggestions ("age of AIs")

(chess, go, creating nice images, poetry seems to be at different stages of this sequence)

This seems to lead to a different intuition than the lawyer-owner case. 

Also: designer-engineer and lawyer-owner problems seem both related to communication bottleneck between two human brains. 

With the exception of some relatively recent and isolated pockets of research on embedded agency (e.g., Orseau & Ring, 2012; Garrabrant & Demsky, 2018), most attempts at formal descriptions of living rational agents — especially utility-theoretic descriptions — are missing the idea that living systems require and maintain boundaries.

While I generally like the post, I somewhat disagree with this summary of state of understanding, which seems to ignore quite a lot of academic research. In particular

- Friston et al certainly understand this (cf ... dozens to hundreds papers claiming and explainting the importance of boundaries for living systems)
- the whole autopoiesis field
- various biology-inspired papers (eg this)

I do agree this way of thinking it is less common among people stuck too much in the VNM basin, such as most of econ or most of game theory.


I would correct "Therefore, in laying down motivational circuitry in our ancient ancestors, evolution did not have to start from scratch, and already had a reasonably complex 'API' for interoceptive variables."

from the summary to something like this

"Therefore, in laying down motivational circuitry in our ancient ancestors, evolution did have to start locating 'goals' and relevant world-features in the learned world models. Instead, it re-used the the existing goal-specifying circuits, and implicit-world-models, existing in older organisms. Most of the goal specification is done via "binding" the older and newer world-models in some important variables. From within the newer circuitry, important part of the "API" between the models is interoception"

(Another way how to think about it: imagine a more blurry line between a "sensory signal" and "reward signal")

<sociology of AI safety rant>

So, if an Everett-branches traveller told me "well, you know, MIRI folks had the best intentions, but in your branch, made the field pay attention to unproductive directions, and this made your civilization more confused and alignment harder" and I had to guess "how?", one of the top choices would be ongoing strawmanning and misrepresentation of Eric Drexler's ideas.


To me, CAIS thinking seems quite different from the description in the op.

Some statements, without much justifications/proofs

- Modularity is a pretty powerful principle/law of intelligent systems. If you look around the world, you see modularity everywhere. Actually in my view you will see more of "modularity" than of "rational agency", suggesting gods of modularity are often stronger than gods of rational agency. Modularity would help as a lot, in contrast to integrated homogenous agents  => one of the key directions of AI safety should be figuring out how to summon Gods of modularity

- Modularity helps with interpretability; once you have "modules" and "interfaces", you have much better shot at understanding what's going on by looking on the interfaces. (For intuitive feel: Imagine you want to make plotting and scheming of three people much more legible, and you can impose this constrain: they need to make all communication between them on Slack, which you can read)

- Any winning strategy needs to solve global coordination at some point, otherwise people will just call a service to destroy the world. Solutions of the type "your aligned superintelligent agent takes over the world and coordinates everyone" are dangerous and won't work; superintelligent agent able takes over the world is something you don't want to bring into existence, and in contrast, you need a security layer to prevent anyone from event attempting that

- There are multiple hard problems; attempting to solve them all at once in one system is not the right tactic. In practice, we want to isolate some problems to separate "modules" or "services" - for example, we want a separate "research and development services", "security service", "ontology mapping service",...

- Many hard problems don't disappear, but there are also technical solutions for them [e.g. distillation]


This seems partially right, partially confused in an important way.

As I tried to point people to years ago, how this works is ... quite complex processes, where some higher-level modelling (“I see a lion”) leads to a response in lower levels connected to body states, some chemicals are released, and this interoceptive sensation is re-integrated in the higher levels.

I will try to paraphrase/expand in a longer form.

Genome already discovered a ton of cybernetics before inventing neocortex-style neural nets. 

Consider e.g. the problem of morphogenesis - that is, how one cell replicates to something like quadrillion cells in an elephant. Which end up reliably forming some body shape and cooperating in a highly complex way: it's really impressive and hard optimization problem.

Inspired by Levine, I'm happy to argue it is also impossible without discovering a lot of powerful stuff from information theory and cybernetics, including various regulatory circuits, complex goal specifications, etc.

Note that there are many organisms without neural nets which still seek reproduction, avoid danger, look for food, move in complex environments, and in general, are living using fairly complex specifications of evolutionary relevant goals. 

This implies genome had complex circuitry specificing many/most of the goal states it's cares about before it invented predictive processing brain.

Given this, what genome did when developing the brain predictive processing machinery likely wasn't trying to hook up things to "raw sensory inputs", but hook up the PP machinery to the existing cybernetic regulatory systems, often broadly localized "in the body".  

From the PP-brain-centric viewpoint, the variables of this evolutionary older control system come in via a "sense" of interoception

The very obvious hack which genome is using in encoding goals to the PP machinery is specifying the goals mostly in interoceptive variables, utilizing the existing control circuits.

Predictive processing / active inference than goes on to build a complex world model and execute complex goal-oriented behaviours.

How these desirable states are encoded was called agenty subparts by me, but according to Friston, is basically the same thing as he calls "fixed priors": as a genome, you for example "fix the prior" on the variable "hunger" to "not being hungry".  (Note that a lot of the specification of what "hunger" is, is done by the older machinery). Generic predictive processing principles than build you a circuitry "around" this "fixed prior" which e.g. cares about objects in the world which are food.  (Using intentional stance, the fixed variable + the surrounding control circuits look like a sub-agent of the human, hence the alternative agenty subpart view)

- genome solves the problem of aligning the predictive processing neural nets by creating a bunch of agenty subparts/fixed priors, caring about specific variables in the predictive processing world model.  Pp/active inference deals with how this translates to sensing and action.
- however, many critical variables used for this are not sensory inputs, but interoceptive variables, extracted from a quite complex computation 

This allows genome to point to stuff like sex or love for family relatively easily and, build "subagents" caring for this. Building of complex policies out of this is then left to predictive processing style of interactions. 

If you would counts this as "direct" or "indirect" seems unclear. 

With the last point: I think can roughly pass your ITT - we can try that, if you are interested. 

So, here is what I believe are your beliefs

  • With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)
  • This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato
  • From your perspective, this is based on thinking deeply about the nature of such system (note that this mostly based on hypothetical systems, and an analogy with evolution)
  • My claim roughly is this is only part of what's going on, where the actual think is: people start with a deep prior on "continuity in the space of intelligent systems". Looking into a specific question about hypothetical systems, their search in argument space is guided by this prior, and they end up mostly sampling arguments supporting their prior.  (This is not to say the arguments are wrong.)
  • You probably don't agree with the above point, but notice the correlations:
    • You expect sharp left turn due to discontinuity in "architectures" dimensions (which is the crux according to you)
    • But you also expect jumps in capabilities of individual systems (at least I think so)
    • Also, you expect majority of hope in a "sharp right turn" histories (in contrast to smooth right turn histories)
    • And more
  • In my view yours (or rather MIRI-esque) views on the above dimensions are correlated more than expected, which suggest the existence of hidden variable/hidden model explaining the correlation. 

I personally think that a large majority of humanity's hope lies in someone executing a pivotal act. But I assume Critch disagrees with this, and holds a view closer to 1+2+3.

If so, then I think he shouldn't go "well, pivotal acts sound weird and carry some additional moral hazards, so I will hereby push for pivotal acts to become more stigmatized and hard to talk about, in order to slightly increase our odds of winning in the worlds where pivotal acts are unnecessary".

Rather, I think hypothetical-Critch should promote the idea of pivotal processes, and try to reduce any existing stigma around the idea of pivotal acts, so that humanity is better positioned to evade destruction if we do end up needing to do a pivotal act. We should try to set ourselves up to win in more worlds.

Can't speak for Critch, but my view is pivotal acts planned as pivotal acts, in the way how most people in LW community think about them, have only a very small chance of being the solution. (my guess is one or two bits more extreme, more like 2-5% than 10%).

I'm not sure if I agree with you re: the stigma. My impression is while the broader world doesn't think in terms of pivotal acts, if it payed more attention, yes, many proposals would be viewed with suspicion. On the other hand, I think on LW it's the opposite: many people share the orthodoxy views about sharp turns, pivotal acts, etc., and proposals to steer the situation more gently are viewed as unworkable or engaging in thinking with "too optimistic assumptions" etc. 

Note that I advocate for considering much more weird solutions, and also thinking much more weird world states when talking with the "general world".  While in contrast, on LW and AF, I'd like to see more discussion of various "boring" solutions on which the world can roughly agree.

Continuity assumptions are about what's likely to happen, not about what's desirable. It would be a separate assumption to say "continuity is always good", and I worry that a reasoning error is occurring if this is being conflated with "continuity tends to occur".

Basically, no. Continuity assumptions are about how the space looks like. Obviously forecasting questions ("what's likely to happen") often depend on ideas how the space looks like.

My claim is that pivotal acts are likely to be necessary for good outcomes, not that they're necessarily likely to occur. If your choices are "execute a pivotal act, or die", then insofar as you're confident this is the case, the base rate of continuous events just isn't relevant.

Yes but your other claim is "sharp left turn" is likely and leads to bad outcomes. So if we partition the space of outcomes good/bad, in both branches you assume it is very likely because of sharp turns. 


The primary argument for hard takeoff isn't "stuff tends to be discontinuous"; it's "AGI is a powerful invention, and e.g. GPT-3 isn't a baby AGI". The discontinuity of hard takeoff is not a primitive; it's an implication of the claim that AGI is different from current AI tech, that it contains a package of qualitatively new kinds of cognition that aren't just 'what GPT-3 is currently doing, but scaled up'.

This is becoming  maybe repetitive, but I'll try to paraphrase again. Consider the option the "continuity assumptions" I'm talking about are not grounded in "takeoff scenarios", but in "how you think about hypothetical points in the abstract space of intelligent systems". 

Thinking about features of this highly abstract space, in regions which don't exist yet, is epistemically tricky (I hope we can at least agree on that).

It probably seems to you, you have many strong arguments giving you reliable insights about how the space works somewhere around "AGI".

My claim is: "Yes, but the process which generated the arguments is based on black-box neural net, which has a strong prior on things like "stuff like math is discontinuous"" (I suspect this "taste and intuition" box is located more in Eliezer's mind, and some other people updated "on the strenght of arguments") This isn't to imply various people haven't done a lot of thinking and generated a lot of arguments and intuitions about this. Unfortunately, given other epistemic constraints, in my view the "taste and intuitions" differences sort of "propagate" to "conclusion" differences.

In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes". 

As I understand it, the definition of "pivotal acts" explicitly forbids to consider things like "this process would make 20% per year of AI developers actually take safety seriously with 80% chance" or "what class of small shifts would in aggregate move the equilibrium?". (Where things in this category get straw-manned as "Rube-Goldberg-machine-like")

As often, one of the actual cruxes is in continuity assumptions, where basically you have a low prior on "smooth trajectory changes by many acts" and high prior on "sharp turns left or right".

Second crux, as you note, is doom-by-default probability: if you have a very high doom probability, you may be in favour of variance-increasing acts, where people who are a few bits more optimistic may be much less excited about them, in particular if all plans for such acts they have very unclear shapes of impact distributions.

Given this deep prior differences, it seems reasonable to assume this discussion will lead nowhere in particular. (I've a draft with a more explicit argument why.)

Load More