Roman Leventov

An independent researcher/blogger/philosopher about intelligence and agency (esp. Active Inference), alignment, ethics, interaction of the AI transition with the sociotechnical risks (epistemics, economics, human psychology), collective mind architecture, research strategy and methodology.

Twitter: https://twitter.com/leventov. E-mail: leventov.ru@gmail.com (the preferred mode of communication). I'm open to collaborations and work.

Presentations at meetups, workshops and conferences, some recorded videos.

I'm a founding member of the Gaia Consoritum, on a mission to create a global, decentralised system for collective sense-making and decision-making, i.e., civilisational intelligence. Drop me a line if you want to learn more about it and/or join the consoritum.

You can help to boost my sense of accountability and give me a feeling that my work is valued by becoming a paid subscriber of my Substack (though I don't post anything paywalled; in fact, on this blog, I just syndicate my LessWrong writing).

For Russian speakers: русскоязычная сеть по безопасности ИИ, Telegram group

Sequences

A multi-disciplinary view on AI safety

Comments

if the next generation of models do pose an x-risk, we've mostly already lost—we just don't yet have anything close to the sort of regularity regime we'd need to deal with that in place

Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not? I'm asking because here you seem to assume a defeatist position that only governments are able to shape the actions of the leading AGI labs (which, by the way, are very very few -- in my understanding, only 3 or 4 labs have any chance of releasing a "next generation" model for as much as two years from now, others won't be able to achieve this level of capability even if they tried), but in the post you advocate for the opposite--for voluntary actions taken by the labs, and that regulation can follow.

If the external process is predictable, the LLM will move to parts of the state space that best account for the effects of the environment and its model of the most likely sequences (loosely analogous to a Bayesian posterior).

I think it would be more accurate to say that the dynamics of internal states of LLMs parameterise not just the model of sequences but of the world, including token sequences as the sensory manifestation of it.

I'm sure that LLMs already possess some world models (Actually, Othello-GPT Has A Linear Emergent World Representation), the question is how only really how the structure and mechanics of LLMs' world models are different from the world models of humans.

the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns

Could you please elaborate what do you mean by "alignment story for LLMs" and "shoggoth concerns" here? Do you mean the "we can use nearly value-neutral simulators as we please" story here, or refer to the fact that in a way LLMs are way more understandable to humans than more general RL agents because they use human language, or you refer to something yet different?

OOD misgeneralisation is unlikely to be a direct x-risk from superintelligence

Overall, I think the issue of causal confusion and OOD misgeneralisation is much more about capabilities than about alignment, especially if we are talking about the long-term x-risk from superintelligent AI, rather than short/mid-term AI risk.

OOD misgeneralisation is absolutely inevitable, due to Gödel's incompleteness of the universe and the fact that all the systems that evolve on Earth generally climb up in complexity. Whenever there is a new invention, such as money, internet, (future) autonomous AI agents, the civilisation becomes more complex as a whole, and distribution of many variables change. ("Towards a Theory of Evolution as Multilevel Learning" is my primary source of intuition about this.) In the study of complex systems, there is a postulate that each component (subsystem) is ignorant of the behaviour of the system as a whole, and doesn't know the full effect of its actions. This applies to any components, no matter how intelligent. Humans misgeneralise all the time (examples: lead in petrol, creation of addictive apps such as Instagram, etc.) Superintelligence will misgeneralise, too, though perhaps in ways which are very subtle or even incomprehensible to humans.

Then, it's possible that superintelligence will misgeneralise due to casual confusion on some matter which is critical to humans' survival/flourishing, e. g. about something like qualia, human's consciousness and their moral value. And, although I don't feel this is a negligible risk, exactly because superintelligence probably won't have direct experience or access to human consciousness, I feel this exact failure mode is somewhat minor, compared to all other reasons for which superintelligence might kill us. Anyway, I don't see what can we do about this, if the problem is indeed that superintelligence will not have first-hand experience of human consciousness.


Capabilities consequences
1) The model may not make competent predictions out-of-distribution (capabilities misgeneralisation). We discuss this further in ERM leads to causally confused models that are flawed OOD.

Alignment consequences: 
2) If the model is causally confused about objects related to its goals or incentives, then it might competently pursue changes in the environment that either don’t actually result in the reward function used for training being optimised (objective misgeneralisation).

Did you use the term "objective misgeneralisation" rather than "goal misgeneralisation" on purpose? "Objective" and "goal" are synonyms, but "objective misgeneralisation" is hardly used, "goal misgeneralisation" is the standard term.

Also, I think it's worth noting that this distinction between capabilities and goal misgeneralisation is defined in the RL framework. In other frameworks, such as Active Inference, this is the same thing, because there is no ontological distinction between reward and belief.

It might be suspected that OOD generalisation can be tackled in the scaling paradigm by using diverse enough training data, for example, including data sampled from every possible test environment. Here, we present a simple argument that this is not the case, loosely adapted from Remark 1 from Krueger et al. REx:

The reason data diversity isn’t enough comes down to concept shift (change in ). Such changes can be induced by changes in unobserved causal factors, Z. Returning to the ice cream () and shorts (), and sun () example, shorts are a very reliable predictor of ice cream when it is sunny, but not otherwise. Putting numbers on this, let’s say . Since the model doesn’t observe , there is not a single setting of that will work reliably across different environments with different climates (different ). Instead

depends on , which in turn depends on the climate in the locations where the data was collected. In this setting, to ensure a model trained with ERM can make good predictions in a new “target” location, you would have to ensure that that location is as sunny as the average training location so that  is the same at training and test time. It is not enough to include data from the target location in the training set, even in the limit of infinite training data - including data from other locations changes the overall  of the training distribution. This means that without domain/environment labels (which would allow you to have different  for different environments, even if you can’t observe ), ERM can never learn a non-causally confused model.

Maybe I miss something obvious, but this argument looks wrong to me, or it assumes that the learning algorithm is not allowed to discover additional (conceptual, abstract, hidden, implicit) variables in the training data, but this is false for deep neural networks (but true for random forests). A deep neural network can discover variables that are not present but are probable confounders of several other variables, such as "something that is a confounder of shorts, sunscreen, and ice-cream".

Discovering such hidden confounders doesn't give interventional capacity: Mendel discovered genetic inheritance factors, but without observing them, he couldn't intervene on them. Only the discovery of DNA and later the invention of gene editing technology allowed intervention on genetic factors.

One can say that discovering hidden confounders merely extends what should be considered in-distribution environment. But then, what is OOD generalisation, anyway? And can't we prove that ERM (or any other training method whatsoever) will create models which will fail sometimes simply because there is Gödel's incompleteness in the universe?

While this model might not make very good predictions, it will correctly predict that getting you to put on shorts is not an effective way of getting you to want ice cream, and thus will be a more reliable guide for decision-making (about whether to wear shorts).

I don't understand the italicised part of this sentence. Why will P(shorts, ice cream) be a reliable guide to decision-making?

(a, ii =>)

What do these symbols in parens before the claims mean?

My current favourite notion of agency, primarily based on Active Inference, which I refined upon reading "Discovering Agents", is the following:

Agency is a property of a physical system from some observer’s subjective perspective. It stems from the observer’s generative model of the world (including the object in question), specifically whether the observer predicts the agent's future trajectory in the state space by assuming that the agent has its own generative model which the agent uses to act. The agent's own generative model also depends on (adapts to, is learned from, etc.) the agent's environment. This last bit comes from "Discovering Agents".

"Having own generative model" is the shakiest part. It probably means that storage, computation, and maintenance (updates, learning) of the model all happen within the agent's boundaries: if not, the agent's boundaries shall be widened, as in the example of "thermostat with its creation process" from "Discovering Agents". The storage and computational substrate of the agent's generative model is not important: it could be neuronal, digital, chemical, etc.

Now, the observer models the generative model inside the agent. Here's where this Vingean veil comes from: if the observer has perfect observability of the agent's internals, then it is possible to believe that your model of the agent exactly matches the agent's own generative model, but usually, it will be less than perfect, due to limited observability.

However, even perfect observability doesn't guarantee safety: the generative model might be large and effectively incompressible (the halting problem), so the only way to see what it will do may be to execute it.

The theory of mind is a closely related idea to all of the above, too.