• Director and Co-Founder of "Principles of Intelligent Behaviour in Biological and Social Systems" (
  • Research Affiliate and PhD student with the Alignment of Complex Systems group, Charles University (

Main research interests: 

  • How can a naturalized understanding of intelligent behavior (across systems, scales and substrates) be translated into concrete progress towards making AI systems safe and beneficial? 
  • What are scientific and epistemological challenges specific to making progress on questions in AI risk, governance and safety? And how can we overcome them?

Other interests:

  • Alternative AI paradigms and their respective capabilities-, safety-, and governability-profiles
  • The dual (descriptive-prescriptive) nature of the study of agency and the sciences of the artificial
  • Pluralist epistemic perspective on the landscape of AI risks
  • The "think" interface between technical and governance aspects of AI alignment
  • ...and more general ideas from philosophy & history of science, political philosophy, complex systems studies, and broadly speaking enactivist theories of cognition, ... — as they are relevance to questions in AI risk/governance/safety

Going back further, I have also spent a bunch of time thinking about how (bounded) minds make sense of and navigate a (complex) world (i.e. rationality, critical thinking, etc.). I have several years of experience in research organization, among others from working  at FHI, CHERI, Epistea, etc. I have a background in International Relations, and spend large parts of of 2017-2019 doing complex systems inspired research on understanding group decision making and political processes with the aim of building towards an appropriate framework for "Longterm Governance".  


The Value Change Problem (sequence)
Thoughts in Philosophy of Science of AI Alignment

Wiki Contributions


Does it seem like I'm missing something important if I say "Thing = Nexus" gives a "functional" explanation of what thing is, i.e. it serves the function of being an "inductive nexus of reference". This is not a foundational/physicalist/mechanistic explanation, but it is very much a sort of explanation that I can imagine being useful in some cases/for some purposes.

I'm suggesting this as a possibly different angle at "what sort of explanation is Thing=Nexus, and why is it plausibly not fraught despite it's somewhat-circularity?" It seems like it maps on to /doesn't contract anything you say (note: I only skimmed the post so might have missed some relevant detail, sorry!), but I wanted to check whether, even if not conflicting, it misses something you think is or might be important somehow.

Re whether messy goal-seekers can be schemers, you may address this in a different place (and if so forgive me, and I'd appreciate you pointing me to where), but I keep wondering what notion of scheming (or deception, etc.) we should be adopting when, in particular: 

  • an "internalist" notion, where 'scheming' is defined via the "system's internals", i.e. roughly: the system has goal A, acts as if it has goal B, until the moment is suitable to reveal it's true goal A.
  • an "externalist" notion, where 'scheming' is defined, either, from the perspective of an observer (e.g. I though the system has goal B, maybe I even did a bunch of more or less careful behavioral tests to raise my confidence in this assumption, but in some new salutation, it gets revealed that the system pursues B instead)
  • or an externalist notion but defined via the effects on the world that manifest (e.g. from a more 'bird's-eye' perspective, we can observe that the system had a number of concrete (harmful) effects on one or several agents via the mechanisms that those agents misjudged what goal the system is pursuing (therefor e.g. mispredicting its future behaviour, and basing their own actions on this wrong assumption)

It seems to me like all of these notions have different upsides and downsides. For example:

  • the internalist notion seems (?) to assume/bake into its definition of scheming a high degree of non-sphexishness/consequentialist cognition
  • the observer-dependent notion comes down to being a measure of the observer's knowledge about the system 
  • the effects-on-the-world based notion seems plausibly too weak/non mechanistic to be helpful in the context of crafting concrete alignment proposals/safety tooling

Related to my point above (and this quoted paragraph), a fundamental nuance here is the distinction between "accidental influence side effects"  and "incentivized influence effects". I'm happy to answer more questions on this difference if it's not clear from the rest of my comment.

Thanks for clarifying; I agree it's important to be nuanced here!

I basically agree with what you say. I also want to say something like: whether to best count it as side effect or incentivized depends on what optimizer we're looking at/where you draw the boundary around the optimizer in question. I agree that a) at the moment, recommender systems are myopic in the way you describe, and the larger economic logic is where some of the pressure towards homogenization comes from (while other stuff is happening to, including humans pushing to some extent against that pressure, more or less successfully); b) at some limit, we might be worried about an AI systems becoming so powerful that its optimization arc comes to sufficiently large in scope that it's correctly understood as directly doign incentivized influence; but I also want to point out a third scanrios, c) where we should be worried about basically incentivized influence but not all of the causal force/optimization has to be enacted from wihtin the boundaries of a single/specific AI system, but where the economy as a whole is sufficiently integrated with and accelerated by advanced AI to justify the incentivized influence frame (e.g. a la ascended economy, fully automated tech company singularity). I think the general pattern here is basically one of "we continue to outsource ever more consequential decisions to advanced AI systems, without having figured out how to make these systems reliably (not) do any thing in particular". 

A small misconception that lies at the heart of this section is that AI systems (and specifically recommenders) will try to make people more predictable. This is not necessarily the case.

Yes, I'd agree (and didn't make this clear in the post, sorry) -- the pressure towards predictability comes from a combination of the logic of performative prediction AND the "economic logic" that provide the context in which these performative predictors are being used/applied. This is certainly an important thing to be clear about! 

(Though it also can only give us so much reassurance: I think it's an extremely hard problem to find reliable ways for AI models to NOT be applied inside of the capitalist economic logic, if that's what we're hoping to do to avoid the legibilisation risk.)

Agree! Examples abound. You can never escape your local ideological context - you can only try to find processes that have some hope at occasionally pumping into the bounds of your current ideology and press beyond it - no reliably receipt (just like there is no reliably receipt to make yourself notice your own blind spot) - but there is the hope for things that in expectation and intertemporally can help us with this. 

Which poses a new problem (or clarifies the problem we're facing): we don't get to answer the question of value change legitimacy in a theoretical vacuum -- instead we are already historically embedded in a collective value change trajectory, affecting both what we value but also what we (can) know. 

I think that makes it sound a bit hopeless from one perspective, but on the other hand, we probably also shouldn't let hypothetical worlds we could never have reached weight us down -- there are many hypothetical worlds we still can reach that it is worth fighting for.

Yeah interesting point. I do see the pull of the argument. In particular the example seems well chosen -- where the general form seems to be something like: we can think of cases where our agent can be said to be better off (according to some reasonable standards/form some reasonable vantage point) if the agent can make themselves be committed to continue doing a thing/undergoing a change for at least a certain amount of time. 

That said, I think there are also some problems with it. For example, I'm wary of reifying "I-as-in-CEV" more than what is warranted. For one, I don't know whether there is a single coherent "I-as-in-CEV" or whether there could be several; for two, how should I apply this argument practically speaking given that I don't know what "I-as-in-CEV" would consider acceptable. 

I think there is some sense in which proposing to use legitimacy as criterion has a flavour of "limited ambition" - using it will in fact mean that you will sometimes miss out of making value changes that would have been "good/acceptable" from various vantage points  (e.g. legitimacy would say NO to pressing the button that would magically make everyone in the world peaceful/against war (unless the button involves some sophisticated process that allows you to back out legitimacy for everyone involved)).  At the same time, I am wary we cannot give up on legitimacy without risking much worse fates, and as such, I feel currently fairly compelled to opt for legitimacy form an intertemporal perspective.

yes, sorry! I'm not making it super explicit, actually, but the point is that, if you read e.g. Paul or Callard's accounts of value change (via transformative experiences and via aspiration respectively), a large part of how they even set up their inquiries is with respect to the question whether value change is irrational or not (or what problem value change poses to rational agency). The rationality problem comes up bc it's unclear from what vantage point one should evaluate the rationality (i.e. the "keeping with what expected utiltiy theory tells you to do")  of the (decision to undergo) value change. From the vantage point of your past self, it's irrational; from the vantage point of your new self (be it as parent, vampire or jazz lover), it may be rational. 

Form what I can tell, Paul's framing of transformative experiences is closer to "yes, transformative experiences are irrational (or a-rational) but they still happen; I guess we have to just accept that as a 'glitch' in humans as rational agents"; while Callard's core contribution (in my eyes) is her case for why aspiration is a rational process of value development.

Right, but I feel like I want to say something like "value grounding"  as its analogue. 

Also... I do think there is a crucial epistemic dymension to values, and the "[symbol/value] grounding" thing seems like one place where this shows quite well.

The process that invents democracy is part of some telotect, but is it part of a telophore? Or is the telophore only reached when democracy is implemented?

Musing about how (maybe) certain telopheme impose constraints on the structure (logic) of their corresonding telophores and telotects. Eg democracy, freedom, autonomy, justice, corrigibility, rationality, ... (thought plausibly you'd not want to count (some of) those examples as telophemes in the first place?)


Curious whether the following idea rhymes with what you have in mind: telophore as (sort of) doing ~symbol grounding, i.e. the translation (or capacity to translate) from description to (wordly) effect? 

Load More