Koen Holtman

Computing scientist and Systems architect. Currently doing self-funded AGI safety research.


Counterfactual Planning


Model splintering: moving from one imperfect model to another

The distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent's environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely.

I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.

Generalised models as a category

Definitely, it has also been my experience that you can often get new insights by constructing mappings to different models or notations.

Generalised models as a category

Cross reference: I am not a big fan of stating things in category theory notation, so I made some remarks on the building and interpretation of generalised models in the comment section of this earlier post on model splintering.

Model splintering: moving from one imperfect model to another

Just read the above post and some your related posts on model splintering and symbol grounding. Here are some thoughts and comments, also on some of the other posts.

In this post you are considering a type of of machine learning where the set of features in the learned model can be updated, not just the model's probability distribution . This is neat because it allows you to identify some specific risks associated with model refinements where changes. In many discussions in the AI alignment community, these risks are associated with the keywords 'symbol grounding' and 'ontological crises', so it is good to have some math that can deconfuse and disentangle the issues.

However, you also link model splintering to out-of-distribution robustness. Specifically, in section 1.1:

In the language of traditional ML, we could connect all these issues to "out-of-distribution" behaviour. This is the problems that algorithms encounter when the set they are operating on is drawn from a different distribution than the training set they were trained on.

[....] 2. What should the AI do if it finds itself strongly out-of-distribution?

and then in section 5 you write:

We can now rephrase the out-of-distribution issues of section 1.1 in terms of the new formalism:

  1. When the AI refines its model, what would count as a natural refactoring of its reward function?
  2. If the refinements splinter its reward function, what should the AI do?
  3. If the refinements splinter its reward function, and also splinters the human's reward function, what should the AI do?

Compared to Rohin's comment above, I interpret the strength of this link vert differently.

I believe that the link is pretty weak, in that I cannot rephrase the out-of-distribution problems you mentioned as being the same 'if the AI's refinements do X' problems of section 5.

To give a specific example which illustrates my point:

  • Say that we train a classifier to classify 100x100 pixel 24-bit color pictures as being pictures of either cats or dogs. The in this example consists of symbols that can identify each possible picture, and the symbols and . You can then have a probability distribution that gives you .

  • We train the classifier on correctly labeled pictures of black cats and white dogs only. So it learns to classify by looking at the color of the animal.

  • After training, we move the classifier out-of-distribution by feeding it pictures of white cats, black dogs, cats that look a bit like pandas, etc.

The main observation now is that this last step moves the classifier out-of-distribution. It is not the step of model refinement by the ML system that is causing any out-of-distribution issue here. The classifier is still using the same and , but it has definitely moved out-of-distribution in the last step.

So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.

Or I might think of splintering as something that can have two causes: 1) the ML system/agent landing out of distribution, 2) certain updates that machine learning does.

You are considering several metrics of model splintering above: I believe some of them are splintering metrics that would measure both causes. Others only measure cause 2.

As you note, there is an obvious connection between some of your metrics and those used in several RL and especially IRL reward function learning papers. To detect shattering from cause 2), one might use a metric form such a paper even if the paper did not consider cause 2), only cause 1).

Some more general remarks (also targeted at general readers of this comment section who want to get deeper into the field covered by the above post):

In many machine learning systems, from AIXI to most deep neural nets, the set of model features never changes: the system definition is such that all changes happen inside the model parameters representing .

Systems where a learned function is represented by a neural net with variable nodes, or by a dynamically constructed causal graph, would more naturally be ones where might be updated.

Of course, mathematical modeling is very flexible: one represent any possible system as having a fixed by shoving all changes it ever makes into .

As a general observation on building models to show and analyze certain problems: if we construct a machine learning system where never changes, then we can still produce failure modes that we can interpret as definite symbol grounding problems, or definite cases where the reward function is splintered, according to some metric that measures splintering.

Interpreting such a system as being capable of having an ontological crises gets more difficult, but if you really want to, you could.

I have recently done some work on modeling AGI symbol grounding failures, and on listing ways to avoid them, see section 10.2 of my paper here. (No current plans to cover the topic in the sequence about the topics in the paper too.) I wrote that section 10.2 to be accessible also to people who do not have years of experience with ML math, so in that sense it is similar to what the above post tries to do.

My approach to modeling symbol grounding failure in the paper is similar to that in your blog post here. I model in symbol grounding failures in an agent as failures of prediction that might be proven empirically.

In the terminology of this post, in the paper I advance the argument that it would be very good design practice (and that it is a commonly used design practice in ML architectures) to avoid reward function splintering as follows. First, define the reward function in a way where references only a subset of symbols , where any improved made by model refinement still has the same subset inside it. Furthermore, to prevent splintering, this has to be limited to the of symbols which directly represent a) possible sensor readings of physical sensors connected to the agent compute core, or b) potential commands to physical actuators connected to the agent compute core.

I also mention that in RL architectures with learning on a reward signal, the reward signal is the only sensor reading that one aims to keep symbol grounded always.

In your more recent modeling of symbol grounding errors here, that model strikes me more as being a special case that models symbol mapping failures in translation settings, not the symbol grounding problem we usually worry about in a black box RL agents.

Creating AGI Safety Interlocks

Good questions.

(1) What are the differences / advantages / disadvantages between what you're proposing vs "make an off switch but don't tell the AGI about it"?

With the approach "make an off switch but don't tell the AGI about it", if you have a reasonably curios AGI agent, it is difficult to stop it from eventually learning about the existence of the stop button. So your safety with this approach really relies on the question what the agent will do after it learns about the stop button you made. Will it start caring about preventing the stop button from being pressed?

It is easy to construct an AI agent design that will definitely start caring about preventing its stop button from being pressed in this case. AGI safety thought experiments tend to silently assume that you must always end up with this dangerous type of design in an AGI agent. But this is not the only option.

You can interpret counterfactual planning as a design method that will stop you from ending up with the above dangerous type of AGI design.

The SI agent specification forces you to construct an agent implementation with a very low probability that it starts to care.

I need to explain the very low probability above. The key here is that perfect learning, , provably produces a zero probability that the SI agent has a direct incentive to care. The SI agent can only get a direct incentive to care if . For it to start caring via this route, the machine learning system has to make some specific and large mistakes in learning the properties of the agent environment.

Here is an example of a large learning mistake that will do it. Say that to such an extent that, in the planning world projected by , pressing the stop button will set off some explosives that will destroy the agent's paperclip factory. This is a large mistake because there are absolutely no such explosives in the learning world. But with that , the agent will definitely have an incentive to care about the button.

To achieve the above low probability of caring, we must therefore avoid large learning failures. This implies that we need to start the agent in a state where , where the learning system already starts out with a very reasonable approximation of the agent environment, and where it is unlikely to make large approximation mistakes later on.

(2) do you expect there to be another copy of the off-switch and its consequences (M) inside the St nodes?

Short answer: no, I do not expect there to be any copy of inside the nodes .

(Cross-reference: the mentioned above is the in the section The possibility of learned self-knowledge of the previous post in this sequence. It represents potentially learned self-knowledge about the learning world agent compute core.)

Longer answer: Not for , but under reasonable learning, I do expect there to be a projected model of the stop button itself inside the nodes. I expect that this model will deliver reasonably correct predictions of future stop button states to the arrows that emerge from these nodes.

So the stop button signal will be delivered to the planning world compute core, but the planning world compute core is a counterfactual one. Unlike the core in , it will not stop (in the sense of taking only actions in future) when the button is pressed.

If so, is it one of "the arrows which traverse the walls of the node St"? Because I don't see any arrows from M to St.

When a learning system failure puts some version of inside the nodes , in my above drawing this version would look like a set of small nodes inside the nodes, connected by some arrows, and these nodes might also be connected directly or indirectly to the outside arrows.

I consider the possibility of such an appearing in the nodes very unlikely if one has a high-quality machine learning system, for reasons I discuss in section 10.2 of the paper.

But if you want to reason about some theoretically possible large learning mistakes that put an inside the nodes, some interesting examples are:

  • (will make the SI agent freeze up),

  • (will produce an incentive in the SI agent to prevent the stop button from being pressed).

Creating AGI Safety Interlocks

Or do you imagine that the programmer puts in the world-model by hand, or something?

That is exactly what I imagine. However, only certain key parts of the planning world model are hand-coded by programmer, not the whole thing.

I imagine an AGI world-model being a bit like a giant souped-up version of a probabilistic graphical model that can be learned from scratch and updated on the fly

Yes, that is usually how people imagine it. What I am doing in counterfactual planning is to get more deeply than usual into the details of how these giant souped-up models get built inside the agent's compute core.

The SI agent I specify is one that builds its planning world model to have the exact macro-level structure shown in the diagram that defines . The only degree of freedom that the learning system has is to determine the function , which defines only what happens at the level of detail below, what happens inside the nodes .

If you want to draw the whole planning world as a giant souped-up version of a probabilistic graphical model, you can do so by filling in the insides of the nodes with additional nodes and arrows, nodes and arrows whose detailed structure is specified by :

By making the SI agent construct its planning world models as depicted above, I can reliably build counterfactual planning worlds, even if I have no reliable way to implement something that can dig deep into the structures inside each node. I don't have to dig deep into to locate and change the bits that predict whether the off switch will be pressed. I have set up other parts of the model so that in the planning world, pressing the off switch has no effect on the planning world compute core's ability to take actions .

This mixing of hand-coded and learned world model parts happens in almost all machine learning agent designs I can think of. But most models of machine learning push these mixing steps into the background. In counterfactual planning, I tried to construct a viewpoint that depicts the exact mixing being done as a foreground feature of the planning world diagram.

Another feature of the graphical notation used is that it makes the possibility of having certain machine learning failure modes more visible. In the above picture, there is the obvious possibility that the arrows which traverse the walls of the nodes will not all be connected to right nodes of the learned model inside, as these connections are defined by the learned . In the paper, I define this as a failure of symbol grounding, and I examine this failure mode in the context of the reasonableness constraint . This leads to some interesting insights into the role of random exploration and Occam's law in symbol grounding. (For the details, see section 10 of the paper. I am starting to wonder if I should turn this section 10 into a post in this sequence.)

Graphical World Models, Counterfactuals, and Machine Learning Agents


I don't have any novel modeling approach to resolve your question, I can only tell you about the standard approach.

You can treat planning where multiple actions spanning many time steps are considered as a single chunk as an approximation method, and approximation method for solving the optimal planning problem in the world model. In the paper, I mention and model this type of approximation briefly in section 3.2.1, but that section 3.2.1 is not included in the post above.

Some more details of how a approximation approach using action chunks would work: you start by setting the time step in the planning world model to something arbitrarily small, say 1 millisecond (anything smaller than the sample rate of the agent's fastest sensors will do in practical implementations). Then, treat any action chunk C as a special policy function C(s) where this policy function can return a special value `end' to denote 'this chunk of actions is now finished'. The agent's machine leaning system may then construct a prediction function X(s',s,C) which predicts the probability that, starting in agent environment state s, executing C till the end will land the agent environment in state s'. It also needs to construct a function T(t,s,C) that estimates the probability distribution over the time taken (time steps in the policy C) till the policy ends, and an UC(s,C) that estimates the chunk of utility gained in the underlying reward nodes covered by C. These functions can then be used to compute an approximate solution to the of planning world . Graphically, a whole time series of , and nodes in the model gets approximated by cutting out all the middle nodes and writing the functions X and UC over the nodes and .

Representing the use of the function T in a graphical way is more tricky, it is easier to write the role of that function during the approximation process down by using a Bellman equation that unrolls the world model into individual time lines and ends each line when the estimated time is up. But I won't write out the Bellman equation here.

The solution found by the machinery above will usually be approximately optimal only, and the approximately optimal policy found may also end having estimated by averaging over over a set of world lines that are all approximately N time steps long in , but some world lines might be slightly shorter or longer.

The advantage of this approximation method with action/thought chunks C is that it could radically speed up planning calculations. In the Kahneman and Tversky system 1/system 2 model, something like this happens also.

Now, is is possible to imagine someone creating an illegible machine learning system that is capable of constructing the functions X and UC, but not T. If you have this exact type of illegibility, then you can not reliably (or even semi-reliably) approximate anymore, so you cannot built an approximation of an STH agent around such a learning system. However, learning the function T seems to be somewhat easy to me: there is no symbol grounding problem here, as long as we include time stamps in the agent environment states recorded in the observational record. We humans are also not too bad at estimating how long our action chunks will usually take. By the way, see section 10.2 of my paper for a more detailed discussion of my thoughts on handling illegibility, black box models and symbol grounding. I have no current plans to add that section of the paper as a post in this sequence too, as the idea of the sequence is to be a high-level introduction only.

The Case for a Journal of AI Alignment

An idea for having more AI Alignment peer review 


[...] might solve two problems at once:

  • The lack of public feedback and in-depth peer review in most posts here
  • The lack of feedback at all for newcomers [...]

I think you need to distinguish clearly between wanting more peer interaction/feedback and wanting more peer review

Academic peer review is a form of feedback, but it is mainly a form of quality control, so the scope of the feedback tends to be very limited in my experience.

The most valuable feedback, in terms of advancing the field, is comments like 'maybe if you combine your X with this Y, then something very new/even better will come out'.   This type of feedback can happen in private gdocs or LW/AF comment sections, less so in formal peer review.

That being said, I don't think that private gdocs or LW/AF comment sections are optimal peer interaction/feedback mechanisms, something better might be designed.   (The usual offline solution is to put a bunch of people together in the same building, either permanently or at a conference, and have many coffee breaks. Creating the same dynamics online is difficult.)

To make this more specific, here is what stops me usually from contributing feedback in AF comment sections. The way I do research, I tend to go on for months without reading any AF posts, as this would distract me too much.   When I catch up, I have little motivation to add a quick or detailed comment to a 2-month old post.

The Case for a Journal of AI Alignment

I agree with Ryan's comments above on this being somewhat bad timing to start a journal for publishing work like the two examples mentioned at the start of the post above.  I have an additional reason, not mentioned by Ryan, for feeling this way.

There is an inherent paradox when you want to confer academic credibility or prestige on much of the work that has appeared on LW/AF, work that was produced from an EA or x-risk driven perspective.    Often, the authors chose the specific subject area of the work exactly because at the time, they felt that the subject area was a) important for x-risk while also b) lacking the credibility or prestige in main-stream academia that would have been necessary for academia to produce sufficient work in the subject area.   

If condition b) is not satisfied, or becomes satisfied, then the EA or x-risk driven researchers (and EA givers of research funds) will typically move elsewhere.

I can't see any easy way to overcome this paradox of academic prestige-granting on prestige-avoiding work in an academic-style journal.  So I think that energy is better spent elsewhere.

Some AI research areas and their relevance to existential safety

Nice post!  In particular, I like your reasoning about picking research topics:

The main way I can see present-day technical research benefiting existential safety is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise over the next 10-30 years.  In short, there often needs to be some amount of traction on a technical area before it’s politically viable for governing bodies to demand that institutions apply and improve upon solutions in those areas.

I like this as a guiding principle, and have used it myself, though my choices have also been driven in part by more open-ended scientific curiosity.  But when I apply the above principle, I get to quite different conclusions about recommended research areas.

As a specific example, take the problem of oversight of companies that want to create of deploy strong AI: the problem of getting to a place where society has accepted and implemented policy proposals that demand significant levels of oversight for such companies.  In theory, such policy proposals might be held back by a lack of traction in a particular technical area, but I do not believe this is a significant factor in this case.

To illustrate, here are some oversight measures that apply right now to companies that create medical equipment, including diagnostic equipment that contains AI algorithms. (Detail: some years ago I used to work in such a company.) If the company wants to release any such medical technology to the public, it has to comply with a whole range of requirements about documenting all steps taken in development and quality assurance.  A significant paper trail has to be created, which is subject to auditing by the regulator.  The regulator can block market entry if the processes are not considered good enough.  Exactly the same paper trail + auditing measures could be applied to companies that develop powerful non-medical AI systems that interact with the public.  No technical innovation would be necessary to implement such measures.

So if any activist group or politician wants to propose measures to improve oversight of AI development and use by companies (either motivated by existential safety risks or by a more general desire to create better outcomes in society), there is no need for them to wait for further advances in Interpretability in ML (IntML), Fairness in ML (FairML) or Accountability in ML (AccML) techniques.

To lower existential risks from AI, it is absolutely necessary to locate proposals for solutions which are technically tractable.  But to find such solutions, one must also look at low-tech and different-tech solitions that go beyond the application of even more AI research.  The existence of tractable alternative solutions to make massive progress leads me to down-rank the three AI research areas I mention above, at least when considered from a pure existential safety perspective.  The non-existence of alternatives also leads me to up-rank other areas (like corrigibility) which are not even mentioned in the original post.

I like the idea of recommending certain fields for their educational value to existential-safety-motivated researchers. However, I would also recommend that such researchers read broadly beyond the CS field, to read about how other high-risk fields are managing (or have failed to manage) to solve their safety and governance problems.  

I believe that the most promising research approach for lowering AGI safety risk is to find solutions that combine AI research specific mechanisms with more general mechanisms from other fields, like the use of certain processes which are run by humans.

Load More