Koen Holtman

Computing scientist and Systems architect. Currently doing self-funded AGI safety research.


Counterfactual Planning


Finite Factored Sets

Some general comments:

Overcoming blindness

You mention above that Pearl's ontology 'has blinded us to the obvious next question'. I am very sympathetic to research programmes that try to overcome such blindness, this is the kind or research I have been doing myself recently. The main type of blindness that I have been trying to combat is blindness to complex types of self-referencing and indirect representation that can be present inside online machine learning agents, specifically in my recent work I have added a less blind viewpoint by modifying and extending Pearl's causal graphs, so that you end up with a two-causal-diagram model of agency and machine learning. These extensions may be of interest to you, especially in relation to problems of embeddedness, but the main point I want to make here is a methodological one.

What I found, somewhat to my surprise, is that I did not need to develop the full mathematical equivalent of all of Pearl's machinery, in order to shed more light on the problems I wanted to investigate. For example, the idea of d-separation is very fundamental to the type of thing that Pearl does with causal graphs, fundamental to clarifying problems of experimental design and interpretation in medical experiments. But I found that this concept was irrelevant to my aims. Above, you have a table of how concepts like d-separation map to the mathematics developed in your talk. My methodological suggestion here is that you probably do not want to focus on defining mathematical equivalents for all of Pearl's machinery, instead it will be a sign of de-blinding progress if you define new stuff that is largely orthogonal.

While I have been looking at blindness to problems of indirection. your part two subtitle suggests you are looking at blindness with respect to the problem of 'time' instead. However, my general feeling is that you are addressing another type of blindness, both this talk and in 'carthesian frames'. You are working to shed more light on the process that creates a causal model, be it a Pearlian or semi-Pearlian model, the process that generates the nodes and the arrows/relations between these nodes.

The mechanical generation of correct (or at least performant) causal models from observational data is a whole (emerging?) subfield of ML I believe, I have nor read much of the literature in this field, but here is one recent paper that may serve as an entry point.

How I can interpret factoring graphically

Part of your approach is to convert Pearl's partly graphical math into a different, non-graphical formalism you are more comfortable with. That being said, I will now construct a graphical analogy to the operation of factoring you define.

You define factoring as taking a set and creating a set of factors (sets) , such that (in my words) every can be mapped to an equivalent tuple . where , etc.

Graphically, I can depict would be a causal graph with just a single node, a node representing a random variable that takes values in . The factoring would be an n-node graph where each node represents a random variable taking values from . So I can imagine factorization as an operation that splits a single graph node into many nodes .

In terms of mainstream practice in experimental design, this splitting operation replaces a single observable with several sub-observables. Where you depart from normal practice is that you require the splitting operation to create a full bijection: this kind of constraint is much more loosely applied in normal practice. It feels to me you are after some kind of no-loss-of-information criterion in defining partitioning as you do -- the criterion you apply seems to be unnecessarily strict however, though it does create a fun mathematical sequence.

In any case, if a single node splits into nodes , we can wonder how we should picture the arrows between these nodes , that need to be drawn in after the split. Seems to me that this is a key question you are trying to answer: how does the split create arrows, or other relations that are almost but not entirely like Peal's causal arrows? My own visual picture here is that, in the most general case, the split creates fully connected directed graph: each node has an arrow to every other node . This would be a model representation that is compatible with the theory that all observables represented by the nodes are dependent on each other. Then, we might transform this fully connected graph into a DAG, a DAG that is still compatible with observed statistical relations, by deleting certain arrows, and potentially by adding unobserved nodes with emerging arrows. (Trivial example: drawing an arrow is equivalent to stating a theory that maybe is not statistically independent of . If I can disprove that theory, I can remove the arrow.)

This transformation process typically allows for many different candidate DAGs to be created which are all compatible with observational data. Pearl also teaches that we may design and run experiments with causal interventions in order to generate more observational data which can eliminate many of these candidate DAGs.

Finite Factored Sets

My thoughts on naming this finite factored sets: I agree with Paul's observation that

| Factorization seems analogous to describing a world as a set of variables

By calling this 'finite factored sets', you are emphasizing the process of coming up with individual random variables, the variables that end up being the (names of the) nodes in a causal graph. With representing the entire observable 4D history of a world (like a computation starting from a single game of life board state), a factorisation splits such into a tuple of separate, more basic observables . where , etc. In the normal narrative that explains Pearl causal graphs, this splitting of the world into smaller observables is not emphasized. Also, the splitting does not necessarily need to be a bijection. It may loose descriptive information with respect to .

So I see the naming finite factored sets as a way to draw attention to this splitting step, it draws attention to the fact that if you split things differently, you may end up with very different causal graphs. This leaves open the question of course is if really want to name your framework in a way that draws attention to this part of the process. Definitely you spend a lot of time on creating an equivalent to the arrows between the nodes too.

Formal Inner Alignment, Prospectus

I like your agenda. Some comments....

The benefit of formalizing things

First off, I'm a big fan of formalizing things so that we can better understand them. In the case of AI safety that, better understanding may lead to new proposals for safety mechanisms or failure mode analysis.

In my experience, once you manage to create a formal definition, it seldom captures the exact or full meaning you expected the informal term to have. Formalization usually exposes or clarifies certain ambiguities in natural language. And this is often the key to progress.

The problem with formalizing inner alignment

On this forum and in the broader community. I have seen a certain anti-pattern appear. The community has so far avoided getting too bogged down in discussing and comparing alternative definitions and formalization's of the intuitive term intelligence.

However, it has definitely gotten bogged down when it comes to the terms corrigibility, goal-directedness, and inner alignment failure. I have seen many cases of this happening:

The anti-pattern goes like this:

participant 1: I am now going to describe what I mean with the concept of corrigibility, goal-directedness,inner alignment failure, as first step to make progress on this problem of .

participants 2-n: Your description does not correspond to my intuitive concept of at all! Also, your steps 2 and 3 seem to be irrelevant to making progress on my concept of , because of the following reasons.

In this post on corrigibility I have have called corrigibility a term with a high linguistic entropy, I think the same applies to the other two terms above.

These high-entropy terms seem to be good at producing long social media discussions, but unfortunately these discussions seldom lead to any conclusions or broadly shared insights. A lot of energy is lost in this way. What we really want, ideally, is useful discussion about the steps 2 and 3 that follow the definitional step.

On the subject of offering formal versions of inner alignment, you write:

A weakness of this as it currently stands is that I purport to offer the formal version of the inner optimization problem, but really, I just gesture at a cloud of possible formal versions.

My recommendation would be to see the above weakness as a feature, not a bug. I'd be interested in reading posts (or papers) where you pick one formal problem out of this cloud and run with it, to develop new proposals for safety mechanisms or failure mode analysis.

Some technical comments on the formal problem you identify

From your section 'the formal problem', I gather that the problems you associate with inner alignment failures are those that might produce treacherous turns or other forms of reward hacking.

You then consider the question if these failure modes could be suppressed by somehow limiting the complexity of the 'inner optimization' process, limited so that it is no longer capable of finding the unwanted 'malign' solutions. I'll give you my personal intuition on that approach here, by way of an illustrative example.

Say we have a shepherd who wants to train a newborn lion as a sheepdog. The shepherd punishes the lion whenever the lion tries to eat a sheep. Now, once the lion is grown, it will either have internalized the goal of not eating sheep but protecting them, or the goal of not getting punished. If the latter, the lion may at one point sneak up while the shepherd is sleeping and eat the shepherd.

It seems to me that the possibility of this treacherous turn happening is encoded from the start into the lion's environment and the ambiguity inherent in their reward signal. For me, the design approach of suppressing the treacherous turn dynamic by designing a lion that will not be able to imagine the solution of eating the shepherd seems like a very difficult one. The more natural route would be to change the environment or reward function.

That being said, I can interpret Cohen's imitation learner as a solution that removes (or at least attempts to suppress) all creativity from the lion's thinking.

If you want to keep the lion creative, you are looking for a way to robustly resolve the above inherent ambiguity in the lion's reward signal, to resolve it in a particular direction. Dogs are supposed to have a mental architecture which makes this easier, so they can be seen as an existence proof.

Reward hacking

I guess I should re-iterate that, though treacherous turns seem to be the most popular example that comes up when people talk inner optimizers, I see treacherous turns as just another example of reward hacking, of maximizing the reward signal in a way that was not intended by the original designers.

As 'not intended by the original designers' is a moral or utilitarian judgment, it is difficult to capture it in math, except indirectly. We can do it indirectly by declaring e.g. that a mentoring system is available which shows the intention of the original designers unambiguously by definition.

Draft report on existential risk from power-seeking AI

Re: “there is a whole body of work which shows that evolved systems are often power-seeking” -- anything in particular you have in mind here?

For AI specific work, the work by Alex Turner mentioned elsewhere in this comment section comes to mind, as backing up a much larger body of reasoning-by-analogy work, like Omohundro (2008). But the main thing I had in mind when making that comment, frankly, was the extensive literature on kings and empires. In broader biology, many genomes/organisms (bacteria, plants, etc) will also tend to expand to consume all available resources, if you put them in an environment where they can, e.g. without balancing predators.

Draft report on existential risk from power-seeking AI

I have two comments on section 4:

This section examines why we might expect it to be difficult to create systems of this kind that don’t seek to gain and maintain power in unintended ways.

First, I like your discussion in section 4.3.3. The option of controlling circumstances is too often overlooked I feel.

However, your further analysis of the level of difficulty seems to be based mostly on the assumption that we must, or at least will, treat an AI agent as a black box that is evolved, rather than designed. Section 4.5:

[full alignment] is going to be very difficult, especially if we build them by searching over systems that satisfy external criteria, but which we don’t understand deeply, and whose objectives we don’t directly control.

There is a whole body of work which shows that evolved systems are often power-seeking. But at the same time within the ML and AI safety literature, there is also a second body of work on designing systems which are not power seeking at all, or have limited power seeking incentives, even though they contain a machine-learning subsystem inside them. I feel that you are ignoring the existence and status of this second body of work in your section 4 overview, and that this likely creates a certain negative bias in your estimates later on.

Some examples of designs that explicitly try to avoid or cap power-seeking are counterfactual oracles, and more recently imitation learners like this one, and my power-limiting safety interlock here. All of these have their disadvantages and failure modes, so if you are looking for perfection they would disappoint you, but if you are looking for tractable x-risk management, I feel there is reason for some optimism.

BTW, the first page of chapter 7 of Russell's Human Compatible makes a similar point, flatly declaring that we would be toast if we made the mistake of viewing our task as controlling a black box agent that was given to us.

Another (outer) alignment failure story

This story reminds me of the run-up to the 2007-2008 financial crisis:

But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports.

There is also an echo of 'we know that we do not fully understand these complex financial products referencing other complex financial products, even the quants admit they do not fully understand them, but who cares if we are making that much money'.

Overall, if I replace 'AI' above with 'complex financial product', the story reads about the same. So was this story inspired and constructed by transposing certain historical events, or is it just a coincidence?

Learning and manipulating learning

Meta: This comment has my thoughts about the paper Pitfalls of Learning a Reward Function Online. I figure I should post them here so that others looking for comments on the paper might find them.

I read the paper in back in 2020; it was on my backlog ever since to think more about it and share my comments. Apologies for the delay, etc.

Mathematical innovation

First off, I agree with the general observations in the introduction that there are pitfalls to learning a reward function online, with a human in the loop.

The paper looks at options for removing some of these pitfalls, or at least to make them less dangerous. The research agenda pursued by the paper is one I like a lot, an agenda of mathematical innovation. The paper mathematically defines certain provable safety properties (uninfluencability and unriggability), and also explores how useful these might be.

Similar agendas of of mathematical innovation can be found in the work of Everitt et al, for example in Agent Incentives: A Causal Perspective, and in my work, for example in AGI Agent Safety by Iteratively Improving the Utility Function. These also use causal influence diagrams in some way, and try to develop them in a way that is useful for defining and analyzing AGI safety. My personal intuition is that we need more of this type of work, this agenda is important to advancing the field.

The math in the paper

That being said: the bad news is that I believe that the mathematical route explored by Pitfalls of Learning a Reward Function Online is most likely a dead end. Understanding why is of course the interesting bit.

The main issue I will explore is: we have a mathematical property that we label with the natural language word 'uninfluencability'. But does this property actually produce the beneficial 'uninfluencability' effects we are after? Section 4 in the paper also explores this issue, and shows some problems, my main goal here is to add further insights.

My feeling is that 'uninfluencability', the mathematical property as defined, does not produce the effects I am after. To illustrate this, my best example is as follows. Take a reward function that measures the amount of smiling, by the human teaching the agent. observed over the entire history . Take a reward function learning process which assumes (in its prior ) that the probability of the choice for this reward function at the end of the history, , cannot be influenced by the actions taken by the agent during the history, so for example is such that , This reward function learning process is unriggable. But the agent using this reward function learning process also has a major incentive to manipulate the human teacher into smiling, by injecting them with smile-inducing drugs, or whatever.

So it seems to me that the choice taken in the paper to achieve the following design goal:

Ideally, we do not want the reward function to be a causal descendant of the policy.

is not taking us on a route that goes anywhere very promising, given the problem statement. The safety predicate of uninfluencability still allows for conditions that insert the mind of the human teacher directly into the path to value of a very powerful optimizer. To make the mathematical property of 'uninfluencability' do what it says on the tin, it seems to me that further constraints need to be added.

Some speculation: to go this route of adding constraints, I think we need a model that separates the mind state of the teacher, or at least some causal dependents of this mind state, more explicitly from the remainder of the agent environment. There are several such increased-separation causal models in Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective and in Counterfactual planning. This then brings us back on the path of using the math of indifference, or lack of causal incentives, to define safety properties.

Secondary remarks

Here are some further secondary remarks.

With the above remarks. I do not mean to imply that the uninfluencability safety property as defined lacks any value: I may still want to have this as a desirable safety property in an agent. But if it were present, this triggers a new concern: if the environment is such that the reward function is clearly influencable, any learning system prior which is incompatible with that assumption may be making some pretty strange assumptions about the environment. These might produce unsafe consequences, or just vast inefficiencies, in the behavior of the agent.

This theme could be explored more, but the paper does not do so, and I have also not done so. (I spent some time trying to come up with clarifying toy examples, but no example I constructed really clarified things for me.)

More general concern: the approach in the paper suffers somewhat from a methodological problem that I have seen more often in the AI and AGI safety literature. At this point in time, there is a tendency to frame every possible AI-related problem as a machine learning problem, and to frame any solution as being the design of an improved machine learning system. To me, this framing obfuscates the solution space. To make this more specific: the paper sets out to define useful constraints on , a prior over the agent environment, but does not consider the step of first exploring constraints on , the actual agent environment itself. To me, the more natural approach would be to first look for useful constraints on , and only then to consider the option of projecting these into as a backup option, when happens to lack the constraints.

In my mind, the problem of an agent manipulating its teacher or supervisor to maximize its reward is not a problem of machine learning, but more fundamentally a problem of machine reasoning, or even more fundamentally a problem which is present in any game-theoretical setup where rewards are defined by a level of indirection. I talk more at length about these methodological points in my paper on counterfactual planning.

If I use this level-of-indirection framing to back-project the design in the paper, my first guess would be that 'uninfluencability' might possibly say something about the agent having no incentive to hack its own compute core in order to change the reward function encoded within. But I am not sure if that first guess would pan out.

Disentangling Corrigibility: 2015-2021

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence.

OK, I think I see what inspired your question.

If you want to give this kind of give the math the kabbalah treatment, you may also look at the math in [EFDH16], which produces agents similar to my definitions (4) (5), and also some variants that have different types of self-reflection. In the later paper here, Everitt et al. develop some diagrammatic models of this type of agent self-awareness, but the models are not full definitions of the agent.

For me, the main questions I have about the math developed in the paper is how exactly I can map the model and the constraints (C1-3) back to things I can or should build in physical reality.

There is a thing going on here (when developing agent models, especially when treating AGI/superintelligence and embeddeness) that also often happens in post-Newtonian physics. The equations work, but if we attempt to map these equations to some prior intuitive mental model we have about how reality or decision making must necessarily work, we have to conclude that this attempt raises some strange and troubling questions.

I'm with modern physics here (I used to be an experimental physicist for a while), where the (mainstream) response to this is that 'the math works, your intuitive feelings about how X must necessarily work are wrong, you will get used to it eventually'.

BTW, I offer some additional interpretation of a difficult-to-interpret part of the math in section 10 of my 2020 paper here.

How does your math interact with quantilization?

You could insert quantilization in several ways in the model. Most obvious way is to change the basic definition (4). You might also define a transformation that takes any reward function and returns a quantilized reward function , this gives you a different type of quantilization, but I feel it would be in the same spirit.

In a more general sense, I do not feel that quantilization can produce the kind of corrigibility I am after in the paper. The effects you get on the agent by changing into , by adding a balancing term to the reward function, are not the same effects produced by quantilization.

Disentangling Corrigibility: 2015-2021

My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!

OK, that clarifies your stance. You feeling weird definitely created a weird vibe in the narrative structure of your comment, a vibe that I picked up on.

(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn't totally off-base, and then spent the rest of the time composing a reply.)

You writing it quickly in half an hour also explains a lot about how it reads.

it's returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility.

I guess we have established by now that the paper is not about your version of intuitive-corrigibility.

For my analysis of intuitive-corrigibility, see the contents of the post above. My analysis is that intuitions on corrigibility are highly diverse, and have gotten even more diverse and divergent over time.

You interpret the abstract as follows:

You aren't just saying "I'll prove that this AI design leads to such-and-such formal property", but (lightly rephrasing the above): "This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started.

Yes that is what I am saying in the abstract. Your light rephrasing where you add [significantly increases the probability that] indeed expresses the message I intended to convey.

[I] prove that the corrigibility layer works as intended in a large set of non-hostile universes."

The phrasing 'works as intended' in the abstract is supposed to indicate clearly that the layer is designed to produce specific such-and-such formal corrigibility properties only, not some broad idea of 'intuitive corrigibility'.

So I am guessing you did not pick up on that when reading the abstract.

OK, moving away from a discussion about abstracts, initial impressions, feelings and their causes, moving towards a discussion of more technical stuff:

But if the 'off-switch' is only a binary sensory modality (there's a channel that says '0' or '1' at each time step), then how do you have AIXI pick out 'the set of worlds in which humans are pressing the button' versus 'the set of worlds in which a rock fell on the button'?

In the paper I don't try to make the agent's world model distinguish between 'humans are pressing the button' versus 'the set of worlds in which a rock fell on the button'. The 'works as intended' is that any button press for whatever reason is supposed to stop the agent.

So this is a design tradeoff I am making. Specifically, this is driven by a lack of trust that any ML-based 'human versus rock' discrimination system will work reliably enough in every possible future emergency. The design tradeoff is to accept the risk that a stray rock will stop the agent accidentally, potentially causing some level of inconvenience.

Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.

This is not going to happen for AGI, since we might not survive testing on-distribution,

You are saying we might not survive AGI. This kind of observation is almost a tautology, and therefore not very interesting.

What interests me is identifying techniques that can increase the chance of survival. Drive down the risk. Admit that zero risk is impossible and move on. There are plenty of risk-reduction techniques out there if you look.

and how would we ensure that the environment "stays on-distribution"? Is that like, pausing the world forever?

Here is a paperclip optimizer example of 'ensuring that the agent environment never goes off-distribution'. Your design goals are to 1) keep your paperclip optimizing AGI agent inside your paperclip factory, and also 2) to make sure that crazy people and/or radically new technologies do not manage to get into the factory.

In other words, keep the environment inside the factory on-distribution, and keep the agent in the factory.

Yes, you will never perfectly achieve these two goals with 100% certainty. Your goal is to drive down the risk of failure on both of them.

For goal 1), there are many techniques that come to mind, you probably want to use them all. 1a) physical barriers that the agent cannot breach with the resources it has at hand, 1b) active agent oversight, where oversight has a kill switch and is not afraid to use it 1c) agent design measures (penalty terms in the reward function, use of counterfactual planning, certain training regimes, etc.) that robustly lower the chance of the agent ever wanting to get out of the factory. 1d) automatic safety interlocks.

I spent some time in my professional life doing safety engineering for physical systems and cyber-physical systems, and this is how safety gets done there. Progress is often expensive, often annoying to certain stakeholders, and not always philosophically very satisfying. But you cannot have everything.

Disentangling Corrigibility: 2015-2021

Could you change the definition of to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure is large and small after the button press.

In general if you would forcefully change the agent's reward function into some , it will self-preserve from that moment on and try to maintain this , so it won't self-edit its back into the original form.

There are exceptions to this general rule, for special versions of and special versions of agent environments (see section 7.2), where you can get the agent to self-edit, but on first glance, your example above does not seem to be one.

If you remove the bits from the agent definition then you can get an agent that self-edits a lot, but without changing its fundamental goals. The proofs of 'without changing its fundamental goals' will get even longer and less readable than the current proofs in the paper, so that is why I did the privileging.

Load More