• Co-Founder and Head of "Principles of Intelligent Behaviour in Biological and Social Systems" (
  • Research Affiliate and incoming PhD at "Alignment of Complex Systems" research group, Charles University, Prague (

I am a student in Philosophy and AI, with a particular interest in philosophy of science, political philosophy, complex systems studies, enactivist frameworks -- as they pertain to AI risk, governance and alignment. A lot of my thinking is influenced by exploring what we can learn from the study of intelligent behaviour in currently existing, natural systems (across scales and substrates) towards questions as to the nature, behaviour of and risks related to future AI systems. Among others, I am interested in a generative theory of value, the "think" interface between governance and technical aspects of AI alignment, and spend a decent chunk of time thinking about scientific and epistemological challenges specific to AI alignment research, and how to address them.

Going back further, I have also spent a bunch of time thinking about how (bounded) minds make sense of and navigate a (complex) world (rationality, critical thinking, etc.). I have several years of experience in research organization, among others from working  at FHI, CHERI, Epistea, etc. I have a background in International Relations, and spend large parts of of 2017-2019 doing complex systems inspired research on understanding group decision making and political processes with the aim of building towards an appropriate framework for "Longterm Governance".  


Thoughts in Philosophy of Science of AI Alignment

Wiki Contributions


The process that invents democracy is part of some telotect, but is it part of a telophore? Or is the telophore only reached when democracy is implemented?

Musing about how (maybe) certain telopheme impose constraints on the structure (logic) of their corresonding telophores and telotects. Eg democracy, freedom, autonomy, justice, corrigibility, rationality, ... (thought plausibly you'd not want to count (some of) those examples as telophemes in the first place?)


Curious whether the following idea rhymes with what you have in mind: telophore as (sort of) doing ~symbol grounding, i.e. the translation (or capacity to translate) from description to (wordly) effect? 

Good point! We are planning to gauge time preferences among the participants and fix slots then. What is maybe most relevant, we are intending to accommodate all time zones. (We have been doing this with PIBBSS fellows as well, so I am pretty confident we will be able to find time slots that work pretty well across the globe.)

Here is another interpretation of what can cause a lack of robustness to scaling down: 

(Maybe this is what you have in mind when you talk about single-single alignment not (necessaeraily) scaling to multi-multi alignment - but I am not sure that is the case, and even if it ism I feel pulled to stating it again more as I don't think it comes out as clearly as I would want it to in the original post.)

Taking the example of an "alignment strategy [that makes] the AI find the preferences of values and humans, and then pursu[e] that", robustness to scaling down can break if "human values" (as invoked in the example) don't "survive" reductionism; i.e. if, when we try to apply reducitonism to "human values", we are left with "less" than what we hoped for. 

This is the inverse of saying that there is an important non-linearity when trying to scale (up) from single-single alignment to multi-multi alignment. 

I think interpretation locates the reason for the lack of robustness in neither capabilities nor alignment regime, which is why I wanted to raise it. It's a claim about the nature or structural properties of "human values"; or a hint that we are deeply confused about human values (e.g. that the term currently refers to an incohernet cluster or "unnatural" abstraction)).

What you say about CEV might capture this fully, but whether it does, I think, is an empirical claim of sorts; a proposed solution to the more general diagnosis that I am trying to propose, namely that (the way we currently use the term) "human values" may itself not be robust to scaling down.