My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality

Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague.  Research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Wiki Contributions


This seems to be partially based on (common?) misunderstanding of CAIS as making predictions about concentration of AI development/market power.  As far as I can tell this wasn't Eric's intention: I specifically remember Eric mentioning he can easily imagine the whole "CAIS" ecosystem living in one floor of DeepMind building. 

I feel somewhat frustrated by execution of this initiative.  As far as I can tell, no new signatures are getting published since at least one day before the public announcement. This means even if I asked someone famous (at least in some subfield or circles) to sign, and the person signed, their name is not on the list, leading to understandable frustration of them.  (I already got a piece of feedback in the direction "the signatories are impressive, but the organization running it seems untrustworthy") 

Also if the statement is intended to serve as a beacon, allowing people who have previously been quiet about AI risk to connect with each other, it's essential for signatures to be published. It's nice that Hinton et al. signed, but for many people in academia it would be actually practically useful to know who from their institution signed - it's unlikely that most people will find collaborators in Hinton, Russell or Hassabis.

I feel even more frustrated because this is second time where similar effort is executed by xrisk community while lacking basic operational competence consisting in the ability to accept and verify signatures. So, I make this humble appeal and offer to the organizers of any future public statements collecting signatures: if you are able to write a good statement and secure the endorsement of some initial high-profile signatories, but lack the ability to accept, verify and publish more than a few hundreds names, please reach out to me - it's not that difficult to find volunteers for this work. 


I don't know / talked with a few people before posting, and it seems opinions differ.

We also talk about e.g. "the drought problem" where we don't aim to get landscape dry.

Also as Kaj wrote, the problem also isn't how to get self-unaligned

Thanks for the links!

What I had in mind wasn't exactly the problem 'there is more than one fixed point', but more of 'if you don't understand what did you set up, you will end in a bad place'. 

I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don't think this extends to eg. LLMs - if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don't trust the process.

This whole just does not hold.

(in animals)

The only way to transmit information from one generation to the next is through evolution changing genomic traits, because death wipes out the within lifetime learning of each generation.

This is clearly false. GPT4, can you explain? :

While genes play a significant role in transmitting information from one generation to the next, there are other ways in which animals can pass on information to their offspring. Some of these ways include:

  1. Epigenetics: Epigenetic modifications involve changes in gene expression that do not alter the underlying DNA sequence. These changes can be influenced by environmental factors and can sometimes be passed on to the next generation.
  2. Parental behavior: Parental care, such as feeding, grooming, and teaching, can transmit information to offspring. For example, some bird species teach their young how to find food and avoid predators, while mammals may pass on social behaviors or migration patterns.
  3. Cultural transmission: Social learning and imitation can allow for the transfer of learned behaviors and knowledge from one generation to the next. This is particularly common in species with complex social structures, such as primates, cetaceans, and some bird species.
  4. Vertical transmission of symbionts: Some animals maintain symbiotic relationships with microorganisms that help them adapt to their environment. These microorganisms can be passed from parent to offspring, providing the next generation with information about the environment.
  5. Prenatal environment: The conditions experienced by a pregnant female can influence the development of her offspring, providing them with information about the environment. For example, if a mother experiences stress or nutritional deficiencies during pregnancy, her offspring may be born with adaptations that help them cope with similar conditions.
  6. Hormonal and chemical signaling: Hormones or chemical signals released by parents can influence offspring development and behavior. For example, maternal stress hormones can be transmitted to offspring during development, which may affect their behavior and ability to cope with stress in their environment.
  7. Ecological inheritance: This refers to the transmission of environmental resources or modifications created by previous generations, which can shape the conditions experienced by future generations. Examples include beaver dams, bird nests, or termite mounds, which provide shelter and resources for offspring.


Actually, transmitting some of the data gathered during the lifetime of the animal to next generation by some other means is so obviously useful that is it highly convergent. Given the fact it is highly convergent, the unprecedented thing which happened with humans can't be the thing proposed (evolution suddenly discovered how not to sacrifice all whats learned during the lifetime).

Evolution's sharp left turn happened because evolution spent compute in a shockingly inefficient manner for increasing capabilities, leaving vast amounts of free energy on the table for any self-improving process that could work around the evolutionary bottleneck. Once you condition on this specific failure mode of evolution, you can easily predict that humans would undergo a sharp left turn at the point where we could pass significant knowledge across generations. I don't think there's anything else to explain here, and no reason to suppose some general tendency towards extreme sharpness in inner capability gains.

If the above is not enough to see why this is false... This hypothesis would also predict civilizations built by every other species which transmits a lot of data e.g. by learning from parental behaviour - once evolution discovers the vast amounts of free energy on the table this positive feedback loop would just explode.

This isn't the case => the whole argument does not hold.

Also this argument not working does not imply evolution provides strong evidence for sharp left turn.

What's going on?

In fact in my view we do not actually understand what exactly happened with humans. Yes, it likely has something to do with culture, and brains, and there being more humans around. But what's the causality?

Some of the candidates for "what's the actually fundamental differentiating factor and not a correlate"

- One notable thing about humans is, it's likely the second time in history a new type of replicator with R>1 emerged: memes. From replicator-centric perspective on the history of the universe, this is the fundamental event, starting a different general evolutionary computation operating at much shorter timescale. 
- Machiavellian intelligence hypothesis suggests that what happened was humans entered a basin of attraction where selection pressure on "modelling and manipulation of other humans" leads to explosion in brain sizes.  The fundamental thing suggested here is you soon hit diminishing return for scaling up energy-hungry predictive processing engines modelling fixed-complexity environment - soon you would do better by e.g. growing bigger claws. Unless... you hit the Machiavellian basin, where sexual selection forces you to model other minds modelling your mind ... and this creates a race, in a an environment of unbounded complexity. 

- Social brain hypothesis is similar, but the runaway complexity of the environment is just because of the large and social groups. 

- Self-domestication hypothesis: this is particularly interesting and intriguing. The idea is humans self-induced something like domestication selection, selecting for pro-social behaviours and reduction in aggression. From an abstract perspective, I would say this allows emergence of super-agents composed of individual humans, more powerful than individual humans. (once such entities exist, they can create further selection pressure for pro-sociality)

or, a combination of the above, or something even weirder

The main reason why it's hard to draw insights from evolution of humans to AI isn't because there is nothing to learn, but because we don't know why what happened happened. 

I don't see how the comparison of hardness of 'GPT task' and 'being an actual human' should technically work - to me it mostly seems like a type error. 

- The task 'predict the activation of photoreceptors in human retina' clearly has same difficulty as 'predict next word on the internet' in the limit. (cf Why Simulator AIs want to be Active Inference AIs)

- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'. But this comparison does not seem to be particularly informative.

- Going in this direction we can make comparisons between thresholds closer to reality e.g. 'predict the activation of photoreceptors in human retina, and do other similar computation well enough to be able to function as a typical human'  vs. 'predict next word on the internet, at the level of GPT4' . This seems hard to order - humans are usually able to do the human task and would fail at the GPT4 task at GPT4 level; GPT4 is able to do the GPT4 task and would fail at the human task. 

- You can't make an ordering between cognitive systems based on 'system A can't do task T system B can, therefore B>A' . There are many tasks which human's can't solve, but this implies very little. E.g. a human is unable to remember 50 thousand digit random number and my phone can easily, but there are also many things which human can do and my phone can't.

From the above the possibly interesting direction of comparisons of 'human skills' and 'GPT-4 skills' is something like 'why can't GPT4 solve the human task at human level' and 'why can't human solve the GPT task on GPT4 level' and 'why are the skills are a bit hard to compare'.

Some thoughts on this

- GPT4 clearly is "width superhuman": it's task is ~modelling of textual output of the whole humanity. This isn't a great fit for the architecture and bounds of a single human mind roughly for the same reasons why a single human mind would do worse than Amazon recommender in recommending products to each of hundred million users. In contrast a human would probably do better in recommending products to one specific user whose preferences the human recommender would try to predict in detail.

Humanity as a whole would probably do significantly better at this task, if you e.g. imagine assigning every human one other human to model (and study in depth, read all their text outputs, etc) 

- GPT4 clearly isn't "samples -> abstractions" better than humans, needing more data to learn the pattern.

- With overall ability to find abstractions, it seems unclear to what extent did GPT "learn smart algorithms independently because they are useful to predict human outputs" vs. "learned smart algorithms because they are implicitly reflected in human text", and at the current level I would expect a mixture of both


While the claim - the task ‘predict next token on the internet’ absolutely does not imply learning it caps at human-level intelligence - is true, some parts of the post and reasoning leading to the claims at the end of the post are confused or wrong. 

Let’s start from the end and try to figure out what goes wrong.

GPT-4 is still not as smart as a human in many ways, but it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.

And since the task that GPTs are being trained on is different from and harder than the task of being a human, it would be surprising - even leaving aside all the ways that gradient descent differs from natural selection - if GPTs ended up thinking the way humans do, in order to solve that problem.

From a high-level perspective, it is clear that this is just wrong. Part of what human brains are doing is to minimise prediction error with regard to sensory inputs. Unbounded version of the task is basically of same generality and difficulty as what GPT is doing, and is roughly equivalent to understand everything what is understandable in the observable universe. For example: a friend of mine worked at analysing the data from LHC, leading to the Higgs detection paper. Doing this type of work basically requires a human brain to have a predictive model of aggregates of outputs of a very large number of collisions of high-energy particles, processed by a complex configuration of computers and detectors. 

Where GPT and humans differ is not some general mathematical fact about the task,  but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded. The different landscape of both boundedness and architecture can lead to both convergent cognition (thinking as the human would do) and the opposite, predicting what the human would output in highly non-human way. 

The boundedness is overall a central concept here. Neither humans nor GPTs are attempting to solve ‘how to predict stuff with unlimited resources’, but a problem of cognitive economy - how to allocate limited computational resources to minimise prediction error.

Or maybe simplest:
 Imagine somebody telling you to make up random words, and you say, "Morvelkainen bloombla ringa mongo."

 Imagine a mind of a level - where, to be clear, I'm not saying GPTs are at this level yet -

 Imagine a Mind of a level where it can hear you say 'morvelkainen blaambla ringa', and maybe also read your entire social media history, and then manage to assign 20% probability that your next utterance is 'mongo'.

The fact that this Mind could double as a really good actor playing your character, does not mean They are only exactly as smart as you.

 When you're trying to be human-equivalent at writing text, you can just make up whatever output, and it's now a human output because you're human and you chose to output that.

 GPT-4 is being asked to predict all that stuff you're making up. It doesn't get to make up whatever. It is being asked to model what you were thinking - the thoughts in your mind whose shadow is your text output - so as to assign as much probability as possible to your true next word.


If I try to imagine a mind which is able to predict my next word when asked to make up random words, and be successful at assigning 20% probability to my true output, I’m firmly in the realm of weird and incomprehensible Gods. If the Mind is imaginably bounded and smart, it seems likely it would not devote much cognitive capacity to trying to model in detail strings prefaced by a context like ‘this is a list of random numbers’, in particular if inverting the process generating the numbers would seem really costly. Being this good at this task would require so much data and cheap computation that this is way beyond superintelligence, in the realm of philosophical experiments.

Overall I think it is really unfortunate way how to think about the problem, where a system which is moderately hard to comprehend (like GPT) is replaced by something much more incomprehensible. Also it seems a bit of a reverse intuition pump - I’m pretty confident most people's intuitive thinking about this ’simplest’ thing will be utterly confused.

How did we got here?


 A human can write a rap battle in an hour.  A GPT loss function would like the GPT to be intelligent enough to predict it on the fly.


Apart from the fact that humans are also able to rap battle or impro on the fly, notice that “what would the loss function like the system to do”  in principle tells you very little about what the system will do. For example, the human loss function makes some people attempt to predict winning lottery numbers. This is an impossible task for humans and you can’t say much about the human based on this. Or you can speculate about minds which would be able to succeed in this task, but you soon get into the realm of Gods and outside of physics.

Consider that sometimes human beings, in the course of talking, make errors.

GPTs are not being trained to imitate human error. They're being trained to *predict* human error.

Consider the asymmetry between you, who makes an error, and an outside mind that knows you well enough and in enough detail to predict *which* errors you'll make.

Again, from the cognitive economy perspective, predicting my errors would often be wasteful.  With some simplification, you can imagine I make two types of errors - systematic, and random. Often the simplest way how to predict the systematic error would be to emulate the process which led to the error.  Random errors are ...  random, and a mind which knows me in enough detail to predict which random errors I’ll make seems a bit like the mind predicting the lottery numbers.

Consider that somewhere on the internet is probably a list of thruples: <product of 2 prime numbers, first prime, second prime>.

GPT obviously isn't going to predict that successfully for significantly-sized primes, but it illustrates the basic point:

There is no law saying that a predictor only needs to be as intelligent as the generator, in order to predict the generator's next token.

 The general claim that some predictions are really hard and you need superhuman powers to be good at them is true, but notice that this does not inform us about what GPT-x will learn. 

Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all the text on the Internet.

Koan:  Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text?  What factors make that task easier, or harder?  

Yes this is clearly true: in the limit the task is of unlimited difficulty.  


This is great & I strongly endorse the program 'let's figure out what's the actual computational anatomy of human values'. (Wrote a post about it few years ago - it wasn't that fit in the sociology of opinions on lesswrong then).

Some specific points where I do disagree

1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as "seek social status" => as a consequence I don't think the evolutionary older drives are out of play and the landscape is flat as you assume, and dominated by language-model-based values

2. Overall, there is a lot of evolutionary older computations running "on the body"; these provide important source of reward signal for the later layers, and this is true and important even for modern humans. Many other things evolved in this basic landscape

3. The world model isn't a value-indepedent goal-orthogonal model; the stuff it learned is implicitly goal-oriented by being steered by the reward model

4. I'm way less optimistic about "aligning with mostly linguistic values". Quoting the linked post

Many alignment proposals seem to focus on interacting just with the conscious, narrating and rationalizing part of mind. If this is just a one part entangled in some complex interaction with other parts, there are specific reasons why this may be problematic.

One: if the “rider” (from the rider/elephant metaphor) is the part highly engaged with tracking societal rules, interactions and memes. It seems plausible the “values” learned from it will be mostly aligned with societal norms and interests of memeplexes, and not “fully human”.

This is worrisome: from a meme-centric perspective, humans are just a substrate, and not necessarily the best one. Also - a more speculative problem may be - schemes learning human memetic landscape and “supercharging it” with superhuman performance may create some hard to predict evolutionary optimization processes.

In other words, large part of what are the language-model-based values could be just what's memetically fit.

Also, in my impression, these 'verbal' values sometimes seem to basically hijack some deeper drive and channel it to meme-replicating efforts. ("So you do care? And have compassion? That's great - here is language-based analytical framework which maps your caring onto this set of symbols, and as a consequence, the best way how to care is to do effective altruism community building")

5. I don't think that "when asked, many humans want to try to reduce the influence of their ‘instinctual’ and habitual behaviours and instead subordinate more of their behaviours to explicit planning" is much evidence of anything. My guess is actually many humans would enjoy more of the opposite - being more embodied, spontaneous, instinctive, and this is also true for some of the smartest people around. 

6. Broadly, I don't think the broad conclusion human values are primarily linguistic concepts encoded via webs of association and valence in the cortex learnt through unsupervised (primarily linguistic) learning is stable upon reflection. 

I've been part or read enough debates with Eliezer to have some guesses how the argument would go, so I made the move of skipping several steps of double-crux to the area where I suspect actual cruxes lie.

I think exploring the whole debate-tree or argument map would be quite long, so I'll just try to gesture at how some of these things are connected, in my map.  

- pivotal acts vs. pivotal processes
-- my take is people's stance on feasibility of pivotal acts vs. processes partially depends on continuity assumptions - what do you believe about pivotal acts?

- assuming continuity, do you expect existing non-human agents to move important parts of their cognition to AI substrates?
-- if yes, do you expect large-scale regulations around that?
--- if yes, will it be also partially automated?

- different route: assuming continuity, do you expect a lot of alignment work to be done partially by AI systems, inside places like OpenAI?
-- if at the same time this is a huge topic for the whole society, academia and politics, would you expect the rest of the world not trying to influence this?

- different route: assuming continuity, do you expect a lot of "how different entities in the world coordinate" to be done partially by AI systems?
-- if yes, do you assume technical features of the system matter? like, eg., multi-agent deliberation dynamics?

- assuming the world notices AI safety as problem (it did much more since writing this post)
-- do you expect large amount of attention and resources of academia and industry will be spent on AI alignment?
---  would you expect this will be somehow related to the technical problems and how we understand them?
--- eg do you think it makes no difference to the technical problem if 300 or 30k people work on it?
---- if it makes a difference, does it make a difference how is the attention allocated?

Not sure if the doublecrux between us would rest on the same cruxes, but I'm happy to try :)

On the topic thinking about it for yourself and posting further examples as comments...

This is GPT4 thinking about convergent properties, using the post as a prompt and generating 20 plausibly relevant convergences. 

  • Modularity: Biological systems, like the human brain, display modularity in their structure, allowing for functional specialization and adaptability. Modularity is also found in industries and companies, where teams and departments are organized to handle specific tasks.
  • Hierarchical organization: In biological systems, hierarchical organization is common, with higher-level structures built from lower-level components. Hierarchies are also observed in companies and organizations.
  • Recurrent connections: Neural networks in the brain have recurrent connections, which allow for feedback and information processing over time. Recurrent structures are also seen in supply chains and communication networks.
  • Redundancy: Redundancy in biological systems provides robustness and fault tolerance. This concept is also utilized in industries for backup systems and fail-safe mechanisms.
  • Adaptation: Biological systems adapt to changing environments to survive. Adaptation is also a crucial aspect for businesses and industries, where they must adjust to market trends and demands.
  • Error correction: Biological systems have mechanisms to detect and correct errors (e.g., DNA repair mechanisms). Error correction is an essential aspect of modern communication systems and data storage.
  • Network robustness: Biological networks (e.g., metabolic networks) exhibit robustness against perturbations. Similar robustness is desirable in communication and transportation networks.
  • Small-world networks: Biological networks often display small-world properties, with short path lengths and high clustering. These properties are found in social networks and the internet.
  • Scale-free networks: Biological networks often exhibit scale-free properties, with a few highly connected nodes (hubs) and many less connected nodes. Scale-free networks are also found in the internet, social networks, and citation networks.
  • Sparsity: Neural networks in the brain are sparse, with many fewer connections than theoretically possible. Sparsity is also utilized in machine learning algorithms and data compression techniques.
  • Decentralization: Biological systems often rely on decentralized control mechanisms. Decentralization can also be seen in blockchain technology and peer-to-peer networks.
  • Homeostasis: Biological systems maintain internal stability through feedback mechanisms. Homeostasis is also relevant to industries, where maintaining stable operating conditions is essential.
  • Oscillations: Oscillatory behavior is common in biological systems, such as circadian rhythms. Oscillations can also be observed in economic cycles and traffic patterns.
  • Synchronization: Synchronization occurs in biological systems, such as the firing of neurons. Synchronization is also essential in distributed computing and communication systems.
  • Division of labor: Division of labor is observed in biological systems (e.g., cells within multicellular organisms) and is a fundamental principle in industries and organizations.
  • Cooperation and competition: Biological systems display a balance of cooperation and competition. These dynamics are also observed in economic systems, business strategies, and social interactions.
  • Plasticity: Plasticity in biological systems allows for learning and adaptation. In industries, plasticity is important for innovation and adaptation to changing market conditions.
  • Evolvability: Biological systems can evolve through mutation and selection. Evolvability is also relevant in industries, where companies must be able to innovate and adapt to survive.
  • Self-organization: Self-organization occurs in biological systems, such as pattern formation in developing organisms. Self-organization is also observed in swarm intelligence and decentralized control systems.
  • Energy efficiency: Biological systems are optimized for energy efficiency, as seen in metabolic pathways. Energy efficiency is also a crucial consideration in industries and technology development.

In my view 
a) it broadly got the idea
b) the result are in my view in a better taste for understand agents than e.g. what you get from karma-ranked LW frontpage posts about AIs on an average day

Load More