My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality

Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague.  Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Wiki Contributions


You are exactly right that active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias.

My guess about what happens in animals and to some extent humans: part of the 'sensory inputs' are interoceptive, tracking internal body variables like temperature, glucose levels, hormone levels, etc. Evolution already built a ton of 'control theory type cirquits' on the bodies (an extremely impressive optimization task is even how to build a body from a single cell...). This evolutionary older circuitry likely encodes a lot about what the evolution 'hopes for' in terms of what states the body will occupy. Subsequently, when building predictive/innocent models and turning them into active inference, my guess a lot of the specification is done by 'fixing priors' of interoceptive inputs on values like 'not being hungry'.  The later learned structures than also become a mix between beliefs and goals: e.g. the fixed prior on my body temperature during my lifetime leads to a model where I get 'prior' about wearing a waterproof jacket when it rains, which becomes something between an optimistic belief and 'preference'.  (This retrodicts a lot of human biases could be explained as "beliefs" somewhere between "how things are" and "how it would be nice if they were")

But this suggests an approach to aligning embedded simulator-like models: Induce an optimism bias such that the model believes everything will turn out fine (according to our true values)

My current guess is any approach to alignment which will actually lead to good outcomes must include some features suggested by active inference. E.g. active inference suggests something like 'aligned' agent which is trying to help me likely 'cares' about my 'predictions' coming true, and has some 'fixed priors' about me liking the results. Which gives me something avoiding both 'my wishes were satisfied, but in bizarre goodharted ways' and 'this can do more than I can'

In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem 

The downsides
- almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models
- the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions)
- the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research)

Overall, 'shards' become so popular that reading at least the basics is probably necessary to understand what many people are talking about. 

This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.

I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the debate in which there is some unified 'safety' camp vs. 'optimists'.

Also I think this demonstrates that 'just stating your beliefs' in moderately-dimensional projection could be useful type of post, even without much justification.

The post is influential, but makes multiple somewhat confused claims and led many people to become confused. 

The central confusion stems from the fact that genetic evolution already created a lot of control circuitry before inventing cortex, and did the obvious thing to 'align' the evolutionary newer areas: bind them to the old circuitry via interoceptive inputs. By this mechanism, genome is able to 'access' a lot of evolutionary relevant beliefs and mental models. The trick is the higher/more distant to genome models are learned in part to predict interoceptive inputs (tracking evolutionary older reward circuitry), so they are bound by default, and there isn't much independent to 'bind'. Anyone can check this... just thinking about a dangerous looking person with a weapon activates older, body-based fear/fight chemical regulatory circuits => the active inference machinery learned this and plans actions to avoid these states.


Part of ACS research directions fits into this - Hierarchical Agency, Active Inference based pointers to what alignmnent means, Self-unalignment

My impression is you get a lot of "the later" if you run "the former" on the domain of language and symbolic reasoning, and often the underlying model is still S1-type. E.g.

rights inherent & inalienable, among which are the preservation of life, & liberty, & the pursuit of happiness

does not sound to me like someone did a ton of abstract reasoning to systematize other abstract values, but more like someone succeeded to write words which resonate with the "the former".

Also, I'm not sure why do you think the later is more important for the connection to AI. Curent ML seem more similar to "the former", informal, intuitive, fuzzy reasonining.

Re self-unalignment: that framing feels a bit too abstract for me; I don't really know what it would mean, concretely, to be "self-aligned". I do know what it would mean for a human to systematize their values—but as I argue above, it's neither desirable to fully systematize them nor to fully conserve them. 

That's interesting - in contrast, I have a pretty clear intuitive sense of a direction where some people have a lot of internal conflict and as a result their actions are less coherent, and some people have less of that.

In contrast I think in case of humans who you would likely describe as 'having systematized there values' ... I often doubt what's going on.  A lot people who describe themselves as hardcore utilitarians seem to be ... actually not that, but more resemble a system where somewhat confused verbal part fights with other parts, which are sometimes suppressed.

Identifying whether there's a "correct" amount of systematization to do feels like it will require a theory of cognition and morality that we don't yet have.

That's where I think looking at what human brains are doing seems interesting. Even if you believe the low-level / "the former" is not what's going with human theories of morality, the technical problem seems very similar and the same math possibly applies 

"Systematization" seems like either a special case of the Self-unalignment problem

In humans, it seems the post is somewhat missing what's going on. Humans are running something like this

...there isn't any special systematization and concretization process. All the time, there are models running at different levels of the hierarchy, and every layer tries to balance between prediction errors from more concrete layers, and prediction errors from more abstract layers.

How does this relate to "values" ... from low-level sensory experience of cold, and fixed prior about body temperature, the AIF system learns more abstract and general "goal-belief" about the need to stay warm, and more abstract sub-goals about clothing, etc. At the end there is a hierarchy of increasingly abstract "goal-beliefs" what I do, expressed relative to the world model.

What's worth to study here is  how human brains manage to keep the hierarchy mostly stable

I'll try to keep it short

All the cross-generational information channels you highlight are at rough saturation, so they're not able to contribute to the cross-generational accumulation of capabilities-promoting information.

This seems clearly contradicted by empirical evidence. Mirror neurons would likely be able to saturate what you assume is brains learning rate, so not transferring more learned bits is much more likely because marginal cost of doing so is higher than than other sensible options. Which is a different reason than "saturated, at capacity".

Firstly, I disagree with your statement that other species have "potentially unbounded ways how to transmit arbitrary number of bits". Taken literally, of course there's no species on earth that can actually transmit an *unlimited* amount of cultural information between generations

Sure. Taken literally, the statement is obviously false ... literally nothing can store arbitrary number of bits because of Bekenstein bound. More precisely, the claim is existing non-human ways how to transmit leaned bits to the next generation in practice do not seem to be constrained by limits how many bits they can transmit, but by some other limits (e.g. you can transmit more bits than the capacity of the animal to learn).

Secondly, the main point of my article was not to determine why humans, in particular, are exceptional in this regard. The main point was to connect the rapid increase in human capabilities relative to previous evolution-driven progress rates with the greater optimization power of brains as compared to evolution. Being so much better at transmitting cultural information as compared to other species allowed humans to undergo a "data-driven singularity" relative to evolution. While our individual brains and learning processes might not have changed much between us and ancestral humans, the volume and quality of data available for training future generations did increase massively, since past generations were much better able to distill the results of their lifetime learning into higher-quality data.

1. As explained in my post, there is no reason to assume ancestral humans were so much better at transmitting information as compared to other species

2. The qualifier they were better at transmitting cultural information may (or may not) do a lot of work. 

The crux is something like "what is the type signature of culture".  Your original post roughly assumes "it's just more data". But this seems very unclear: a comment above yours, jacob_cannell confidently claims I miss the forest and makes a guess the critical innovation is "symbolic language". But, obviously, "symbolic language" is a very different type of innovation than "more data transmitted across generations". 

Symbolic language likely
- allows to use any type of channel more effectively
- in particular, allows more efficient horizontal synchronization, allowing parallel computations across many brains
- overall sounds more like software upgrade

Consider plain old telephone network wires: these have surprisingly large intrinsic capacity, which isn't that effectively used by analog voice calls.  Yes, when you plug a modem on both sides you experience "jump" in capacity - but this is much more like "software update" and can be more sudden.

Or a different example - empirically, it seems possible to teach various non-human apes sign language (their general purpose predictive processing brains are general enough to learn this). I would classify this as "software" or "algorithm" upgrade,. If someone did this to a group of apes in the wild, it seems plausible knowledge of language would stick and make them differentially more fit. But teaching apes symbolic language sounds in principle different from "it's just more data" or "it's a higher quality data", and implications for AI progress would be different.

it relies on resource overhand being a *necessary* factor,

My impression is compared to your original post your model drifts to more and more general concepts where it becomes more likely true,  harder to refute and less clear what the implication for AI is.  What is the "resource" here? Does negentropy stored in wood count as "a resource overhang"?

I'm arguing specifically against a version where "resource overhang" is caused by "exploitable resources you easily unlock by transmitting more bits learned by your brain vertically to your offspring brain" because your map of humans to AI progress is based on quite specific model of what are the bottlenecks and overhangs. 

If the current version of the argument is "sudden progress happens exactly when (resource overhang) AND ..." with "generally any kind of resource" then yes, this sounds more likely, but it seems very unclear what does this imply for AI.

(Yes I'm basically not discussing the second half of the article)

This seems to be partially based on (common?) misunderstanding of CAIS as making predictions about concentration of AI development/market power.  As far as I can tell this wasn't Eric's intention: I specifically remember Eric mentioning he can easily imagine the whole "CAIS" ecosystem living in one floor of DeepMind building. 

I feel somewhat frustrated by execution of this initiative.  As far as I can tell, no new signatures are getting published since at least one day before the public announcement. This means even if I asked someone famous (at least in some subfield or circles) to sign, and the person signed, their name is not on the list, leading to understandable frustration of them.  (I already got a piece of feedback in the direction "the signatories are impressive, but the organization running it seems untrustworthy") 

Also if the statement is intended to serve as a beacon, allowing people who have previously been quiet about AI risk to connect with each other, it's essential for signatures to be published. It's nice that Hinton et al. signed, but for many people in academia it would be actually practically useful to know who from their institution signed - it's unlikely that most people will find collaborators in Hinton, Russell or Hassabis.

I feel even more frustrated because this is second time where similar effort is executed by xrisk community while lacking basic operational competence consisting in the ability to accept and verify signatures. So, I make this humble appeal and offer to the organizers of any future public statements collecting signatures: if you are able to write a good statement and secure the endorsement of some initial high-profile signatories, but lack the ability to accept, verify and publish more than a few hundreds names, please reach out to me - it's not that difficult to find volunteers for this work. 


Load More