Koen Holtman

Computing scientist and Systems architect. Currently doing self-funded AGI safety research.

Sequences

Counterfactual Planning

Comments

Disentangling Corrigibility: 2015-2021

In category theory, one learns that good math is like kabbalah, where nothing is a coincidence.

OK, I think I see what inspired your question.

If you want to give this kind of give the math the kabbalah treatment, you may also look at the math in [EFDH16], which produces agents similar to my definitions (4) (5), and also some variants that have different types of self-reflection. In the later paper here, Everitt et al. develop some diagrammatic models of this type of agent self-awareness, but the models are not full definitions of the agent.

For me, the main questions I have about the math developed in the paper is how exactly I can map the model and the constraints (C1-3) back to things I can or should build in physical reality.

There is a thing going on here (when developing agent models, especially when treating AGI/superintelligence and embeddeness) that also often happens in post-Newtonian physics. The equations work, but if we attempt to map these equations to some prior intuitive mental model we have about how reality or decision making must necessarily work, we have to conclude that this attempt raises some strange and troubling questions.

I'm with modern physics here (I used to be an experimental physicist for a while), where the (mainstream) response to this is that 'the math works, your intuitive feelings about how X must necessarily work are wrong, you will get used to it eventually'.

BTW, I offer some additional interpretation of a difficult-to-interpret part of the math in section 10 of my 2020 paper here.

How does your math interact with quantilization?

You could insert quantilization in several ways in the model. Most obvious way is to change the basic definition (4). You might also define a transformation that takes any reward function and returns a quantilized reward function , this gives you a different type of quantilization, but I feel it would be in the same spirit.

In a more general sense, I do not feel that quantilization can produce the kind of corrigibility I am after in the paper. The effects you get on the agent by changing into , by adding a balancing term to the reward function, are not the same effects produced by quantilization.

Disentangling Corrigibility: 2015-2021

My comment was primarily judging your abstract and why it made me feel weird/hesitant to read the paper. The abstract is short, but it is important to optimize so that your hard work gets the proper attention!

OK, that clarifies your stance. You feeling weird definitely created a weird vibe in the narrative structure of your comment, a vibe that I picked up on.

(I had about half an hour at the time; I read about 6 pages of your paper to make sure I wasn't totally off-base, and then spent the rest of the time composing a reply.)

You writing it quickly in half an hour also explains a lot about how it reads.

it's returning to my initial reactions as I read the abstract, which is that this paper is about intuitive-corrigibility.

I guess we have established by now that the paper is not about your version of intuitive-corrigibility.

For my analysis of intuitive-corrigibility, see the contents of the post above. My analysis is that intuitions on corrigibility are highly diverse, and have gotten even more diverse and divergent over time.

You interpret the abstract as follows:

You aren't just saying "I'll prove that this AI design leads to such-and-such formal property", but (lightly rephrasing the above): "This paper shows how to construct a safety layer that [significantly increases the probability that] arbitrarily advanced utility maximizing agents [will not] resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started.

Yes that is what I am saying in the abstract. Your light rephrasing where you add [significantly increases the probability that] indeed expresses the message I intended to convey.

[I] prove that the corrigibility layer works as intended in a large set of non-hostile universes."

The phrasing 'works as intended' in the abstract is supposed to indicate clearly that the layer is designed to produce specific such-and-such formal corrigibility properties only, not some broad idea of 'intuitive corrigibility'.

So I am guessing you did not pick up on that when reading the abstract.

OK, moving away from a discussion about abstracts, initial impressions, feelings and their causes, moving towards a discussion of more technical stuff:

But if the 'off-switch' is only a binary sensory modality (there's a channel that says '0' or '1' at each time step), then how do you have AIXI pick out 'the set of worlds in which humans are pressing the button' versus 'the set of worlds in which a rock fell on the button'?

In the paper I don't try to make the agent's world model distinguish between 'humans are pressing the button' versus 'the set of worlds in which a rock fell on the button'. The 'works as intended' is that any button press for whatever reason is supposed to stop the agent.

So this is a design tradeoff I am making. Specifically, this is driven by a lack of trust that any ML-based 'human versus rock' discrimination system will work reliably enough in every possible future emergency. The design tradeoff is to accept the risk that a stray rock will stop the agent accidentally, potentially causing some level of inconvenience.

Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.

This is not going to happen for AGI, since we might not survive testing on-distribution,

You are saying we might not survive AGI. This kind of observation is almost a tautology, and therefore not very interesting.

What interests me is identifying techniques that can increase the chance of survival. Drive down the risk. Admit that zero risk is impossible and move on. There are plenty of risk-reduction techniques out there if you look.

and how would we ensure that the environment "stays on-distribution"? Is that like, pausing the world forever?

Here is a paperclip optimizer example of 'ensuring that the agent environment never goes off-distribution'. Your design goals are to 1) keep your paperclip optimizing AGI agent inside your paperclip factory, and also 2) to make sure that crazy people and/or radically new technologies do not manage to get into the factory.

In other words, keep the environment inside the factory on-distribution, and keep the agent in the factory.

Yes, you will never perfectly achieve these two goals with 100% certainty. Your goal is to drive down the risk of failure on both of them.

For goal 1), there are many techniques that come to mind, you probably want to use them all. 1a) physical barriers that the agent cannot breach with the resources it has at hand, 1b) active agent oversight, where oversight has a kill switch and is not afraid to use it 1c) agent design measures (penalty terms in the reward function, use of counterfactual planning, certain training regimes, etc.) that robustly lower the chance of the agent ever wanting to get out of the factory. 1d) automatic safety interlocks.

I spent some time in my professional life doing safety engineering for physical systems and cyber-physical systems, and this is how safety gets done there. Progress is often expensive, often annoying to certain stakeholders, and not always philosophically very satisfying. But you cannot have everything.

Disentangling Corrigibility: 2015-2021

Could you change the definition of to , and give the agent the ability to self-modify arbitrarily? The idea is that it would edit itself into its original form in order to make sure is large and small after the button press.

In general if you would forcefully change the agent's reward function into some , it will self-preserve from that moment on and try to maintain this , so it won't self-edit its back into the original form.

There are exceptions to this general rule, for special versions of and special versions of agent environments (see section 7.2), where you can get the agent to self-edit, but on first glance, your example above does not seem to be one.

If you remove the bits from the agent definition then you can get an agent that self-edits a lot, but without changing its fundamental goals. The proofs of 'without changing its fundamental goals' will get even longer and less readable than the current proofs in the paper, so that is why I did the privileging.

Disentangling Corrigibility: 2015-2021

Thanks for expanding on your question about the use of . Unfortunately. I still have a hard time understanding your question, so I'll say a few things and hope that will clarify.

If you expand the term defined in (5) recursively, you get a tree-like structure. Each node in the tree has as many sub nodes as there are elements in the set . The tree is in fact a tree of branching world lines. Hope this helps you visualize what is going on.

I could shuffle around some symbols and terms in the definitions (4) and (5) and still create a model of exactly the same agent that will behave in exactly the same way. So the exact way in which these two equations are written down and recurse on each other is somewhat contingent. My equations stay close to what is used when you model an agent or 'rational' decision making process with a Bellman equation. If your default mental model of an agent is a set of Q-learning equations, the model I develop will look strange, maybe even unnatural at first sight.

or your theory is going to end up with the wrong prior.

OK, maybe this is the main point that inspired your question. The agency/world models developed in the paper are not a 'theory', in the sense that theories have predictive power. A mathematical model used as a theory, like , predicts how objects will accelerate when subjected to a force.

The agent model in the paper does not really 'predict' how agents will behave. The model is compatible with almost every possible agent construction and agent behavior, if we are allowed to pick the agent's reward function freely after observing of reverse-engineering the agent to be modeled.

On purpose, the agent model is constructed with so many 'free parameters' that is has no real predictive power. What you get here is an agent model that can describe almost every possible agent and world in which it could operate.

In mathematics. the technique I am using in the paper is sometimes called 'without loss of generality'. I am developing very general proofs by introducing constraining assumptions 'without loss of generality'.

Another thing to note is that the model of the agent in the paper, the model of an agent with the corrigibility-creating safety layer, acts as a specification of how to add this layer to any generic agent design.

This dual possible use, theory or specification, of models can be tricky if you are not used to it. In observation-based science, mathematical models are usually always theories only. In engineering (and in theoretical CS, the kind where you prove programs correct, which tends to be a niche part of CS nowadays) models often act as specifications. In statistics, the idea that statistical models act as theories tends to be de-emphasized. The paper uses models in the way they are used in theoretical CS.

You may want to take a look at this post in the sequence, which copies text from a 2021 paper where I tried to make the theory/specification use of models more accessible. If you read that post, if might be easier to fully track what is happening, in a mathematical sense, in my 2019 paper.

Disentangling Corrigibility: 2015-2021

OK, so we now have people who read this abstract and feel it makes objectionable 'very large claims' or 'big claims', where these people feel the need to express their objections even before reading the full paper itself. Something vaguely interesting is going on.

I guess I have to speculate further about the root cause of why you are reading the abstract in a 'big claim' way, whereas I do not see 'big claim' when I read the abstract.

Utopian priors?

Specifically, you are both not objecting to the actual contents of the paper, you are taking time to offer somewhat pre-emptive criticism based on a strong prior you have about what the contents of that paper will have to be.

Alex, you are even making rhetorical moves to maintain your strong prior in the face of potentially conflicting evidence:

That said, the rest of this comment addresses your paper as if it's proving claims about intuitive-corrigibility.

Curious. So here is some speculation.

In MIRI's writing and research agenda, and in some of the writing on this forum, there seems to be an utopian expectation that hugely big breakthroughs in mathematical modeling could be made, mixed up with a wish that they must be made. I am talking about breakthroughs that allow us to use mathematics to construct AGI agents that will provably be

  • perfectly aligned

  • with zero residual safety risk

  • under all possible circumstances.

Suppose you have these utopian expectations about what AGI safety mathematics can do (or desperately must do, or else we are all dead soon). If you have these expectations of perfection, you can only be disappointed when you read actually existing mathematical papers with models and correctness proofs that depend on well-defined boundary conditions. I am seeing a lot of pre-emptive expression of disappointment here.

Alex: your somewhat extensive comments above seem to be developing and attacking the strawman expectation that you will be reading a paper that will

  • resolve all open problems in corrigibility perfectly,

  • not just corrigibility as the paper happens to define it, but corrigibility as you define it

  • while also resolving, or at least name-checking, all the open items on MIRI's research agenda

You express doubts that the paper will do any of this. Your doubts are reasonable:

So I think your paper says 'an agent is corrigible' when you mean 'an agent satisfies a formal property that might correspond to corrigible behavior in certain situations.'

What you think is broadly correct. The surprising thing that needs to be explained here is: why would you even expect to get anything different in a paper with this kind of abstract?

Structure of the paper: pretty conventional

My 2019 paper is a deeply mathematical work, but it proceeds in a fairly standard way for such mathematical work. Here is what happens:

  1. I introduce the term corrigibility by referencing the notion of corrigibility developed in the 2015 MIRI/FHI paper

  2. I define 6 mathematical properties which I call corrigibility desiderata. 5 of them are taken straight from the 2015 MIRI/FHI paper that introduced the term.

  3. I construct an agent and prove that it meets these 6 desiderata under certain well-defined boundary conditions. The abstract mentions an important boundary condition right from the start:

A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes.

The paper devotes a lot of space (it is 35 pages long!) to exploring and illustrating the matter of boundary conditions. This is one of the main themes of the paper. In the end, the proven results are not as utopian as one might conceivably hope for,

  1. What I also do in the paper is that I sometimes us the term 'corrigible' as a shorthand for 'provably meets the 6 defined corrigibility properties'. For example I do that in the title of section 9.8.

You are right that the word 'corrigible' is used in the paper in both an informal (or intuitive) sense, and in a more formal sense where it is equated to these 6 properties only. This is a pretty standard thing to do in mathematical writing. It does rely on the assumption that the reader will not confuse the two different uses.

You propose a writing convention where 'POWER' always is the formal in-paper definition of power and 'power' is the 'intuitive' meaning of power, which puts less of a burden on the reader. Frankly I feel that is a bit too much of a departure from what is normal in mathematical writing. (Depends a bit I guess on your intended audience.)

If people want to complain that the formal mathematical properties you named X do not correspond to their own intuitive notion of what the word X really means, then they are going to complain. Does not matter whether you use uppercase or not.

Now, back in 2019 when I wrote the paper, I was working under the assumption that when people in the AGI safety community read the world 'corrigibility', they would naturally map this word to the list of mathematical desiderata in the 2015 MIRI/FHI paper titled 'corrigibility'. So I assumed that my use of the word corrigibility in the paper would not be that confusing or jarring to anybody.

I found out in late 2019 that the meaning of the 'intuitive' term corrigibility was much more contingent, and basically all over the place. See the 'disentangling corrigibility' post above, where I try to offer a map to this diverse landscape. As I mention in the post above:

Personally, I have stopped trying to reverse linguistic entropy. In my recent technical papers, I have tried to avoid using the word corrigibility as much as possible.

But I am not going to update my 2019 paper to covert some words to uppercase.

On the 'bigness' of the mathematical claims

You write:

On p2, you write:

The main contribution of this paper is that it shows, and proves correct, the construction of a corrigibility safety layer that can be applied to utility maximizing AGI agents.

If this were true, I could give you AIXI, a utility function, and an environmental specification, and your method will guarantee it won't try to get in our way / prevent us from deactivating it, while also ensuring it does something non-trivial to optimize its goals? That is a big claim.

You seem to have trouble believing the 'if this were true'. The open question here is how strong of a guarantee you are looking for, when you are saying 'will guarantee' above.

If you are looking for absolute, rock-solid utopian 'provable safety' guarantees, where this method will reduce AGI risk to zero under all circumstances, then I have no such guarantees on offer.

If you are looking for techniques that can will deliver weaker guarantees, of the kind where there is a low but non-zero residual risk of corrigibility failure, if you wrap these techniques around a well-tested AI or AGI-level ML system, these are the kind of techniques that I have to offer.

If this were true it would be an absolute breakthrough

Again, you seem to be looking for the type of absolute breakthrough that delivers mathematically perfect safety always, even though we have fallible humans, potentially hostile universes that might contain unstoppable processes that will damage the agent, and agents that have to learn and act based on partial observation only. Sorry, I can't deliver on that kind of utopian programme of provable safety. Nobody can.

Still, I feel that the mathematical results in the paper are pretty big. They clarify and resolve several issues identified in the 2015 MIRI/FHI paper. They resolve some of these by saying 'you can never perfectly have this thing unless boundary condition X is met', but that is significant progress too.

On the topic of what happens to the proven results when I replace the agent that I make the proofs for with AIXI, see section 5.4 under learning agents. AIXI can make certain prediction mistakes that the agent I am making the proofs for cannot make by definition. These mistakes can have the result of lowering the effectiveness of the safety layer. I explore the topic in some more detail in later papers.

Stability under recursive self-improvement

You say:

I think you might be discussing corrigibility in the very narrow sense of "given a known environment and an agent with a known ontology, such that we can pick out a 'shutdown button pressed' event in the agent's world model, the agent will be indifferent to whether this button is pressed or not."

  1. We don't know how to robustly pick out things in the agent's world model, and I don't see that acknowledged in what I've read thus far.

First off, your claim that 'We don't know how to robustly pick out things in the agent's world model' is deeply misleading.

We know very well 'how to do this' for many types of agent world models. Robustly picking out simple binary input signals like stop buttons is routinely achieved in many (non-AGI) world models as used by today's actually existing AI agents, both hard-coded and learned world models, and there is no big mystery about how this is achieved.

Even with black-box learned world models, high levels of robustness can be achieved by a regime of testing on-distribution and then ensuring that the agent environment never goes off-distribution.

You seem to be looking for 'not very narrow sense' corrigibility solutions where we can get symbol grounding robustness even in scenarios where the AGI does recursive self improvement, where it re-builds is entire reasoning system from the ground up, and where it then possibly undergoes an ontological crisis. The basic solution I have to offer for this scenario is very simple. Barring massive breakthroughs, don't build a system like that if you want to be safe.

The problem of formalizing humility

In another set of remarks you make, you refer to the web page Hard problem of corrigibility, were Ellezer speculates that to solve the problem of corrigibility, what really we want to formalize is not indifference but

something analogous to humility or philosophical uncertainty.

You say about this that

I don't even know how to begin formalizing that property, and so a priori I'd be quite surprised if that were done successfully all in one paper.

I fully share your stance that I would not even know how to begin with 'humility or philosophical uncertainty' and end successfully.

In the paper I ignore this speculation about humility-based solution directions, and leverage and formalize the concept of 'indifference' instead. Sorry to disappoint if you were expecting major progress on the humility agenda advanced by Ellezer.

Superintelligence

Another issue is that you describe a "superintelligent" AGI simulator

Yeah, in the paper I explicitly defined the adjective superintelligent in a somewhat provocative way, I defined 'superintelligent' to mean 'maximally adapted to solving the problem of utility maximization in its universe'.

I know this is somewhat jarring to many people, but in this case it was fully intended to be jarring. It is supposed to make you stop and think...

(This grew into a very long response, and I do not feel I have necessarily addressed or resolved all of your concerns. If you want to move further conversation about the more technical details of my paper or of corrigibility to a video call, I'd be open to that.)

Disentangling Corrigibility: 2015-2021

First I've seen this paper, haven't had a chance to look at it yet, would be very surprised if it fulfilled the claims made in the abstract. Those are very large claims and you should not take them at face value without a lot of careful looking.

I wrote that paper and abstract back in 2019. Just re-read the abstract.

I am somewhat puzzled how you can read the abstract and feel that it makes 'very large claims' that would be 'very surprising' when fulfilled. I don't feel that the claims are that large or hard to believe.

Feel free to tell me more when you have read the paper. My more recent papers make somewhat similar claims about corrigibility results, but they use more accessible math.

Disentangling Corrigibility: 2015-2021

I like your "Corrigibility with Utility Preservation" paper.

Thanks!

I don't get why you prefer not using the usual conditional probability notation.

Well, I wrote in the paper (section 5) that I used instead of the usual conditional probability notation because it 'fits better with the mathematical logic style used in the definitions and proofs below.' i.e. the proofs use the mathematics of second order logic, not probability theory.

However this was not my only reason for this preference. The other reason what that I had an intuitive suspicion back in 2019 that the use of conditional probability notation, in the then existing papers and web pages on balancing terms, acted as an of impediment to mathematical progress. My suspicion was that it acted as an overly Bayesian framing that made it more difficult to clarify and generalize the mathematics of this technique any further.

In hindsight in 2021, I can be a bit more clear about my 2019 intuition. Armstrong's original balancing term elements and , where and are low-probability near-future events, can be usefully generalized (and simplified) as the Pearlian and where the terms are interventions (or 'edits') on the current world state.

The notation makes it look like the balancing terms might have some deep connection to Bayesian updating or Bayesian philosophy, whereas I feel they do not have any such deep connection.

That being said, in my 2020 paper I present a simplified version of the math in the 2019 paper using the traditional notation again, and without having to introduce .

leads to TurnTrout's attainable utility preservation.

Yes it is very related: I explore that connection in more detail in section 12 of my 2020 paper. In general I think that counterfactual expected-utility reward function terms are a Swiss army knifes with many interesting uses. I feel that as a community, we have not yet gotten to the bottom of their possibilities (and their possible failure modes).

Why not use in the definition of ?

In definition of (section 5.3 equation 4) I am using a term, so I am not sure if I understand the question.

(I am running out of time now, will get back to the remaining questions in your comment later)

Disentangling Corrigibility: 2015-2021

Thanks at lot all! I just edited the post above to change the language as suggested.

FWIW, Paul's post on corrigibility here was my primary source for the into that Robert Miles named the technical term. Nice to see the original suggestion as made on Facebook too.

My research methodology

Interesting... On first reading your post, I felt that your methodological approach for dealing with the 'all is doomed in the worst case' problem is essentially the same as my approach. But on re-reading, I am not so sure anymore. So I'll try to explore the possible differences in methodological outlook, and will end with a question.

The key to your methodology is that you list possible process steps which one might take when one feels like

all of our current algorithms are doomed in the worst case.

The specific doom-removing process step that I want to focus on is this one:

If so, I may add another assumption about the world that I think makes alignment possible (e.g. the strategy stealing assumption), and throw out any [failure] stories that violate that assumption [...]

My feeling is that AGI safety/alignment community is way too reluctant to take this process step of 'add another assumption about the world' in order to eliminate a worst case failure story.

These seem to be several underlying causes for this reluctance. One of them is that in the field of developing machine learning algorithms, in the narrow sense where machine learning equals function approximation, the default stance is to make no assumptions about the function that has to be approximated. But the main function to be approximated in the case of an ML agent is the function that determines the behavior of the agent environment. So the default methodological stance in ML is that we can introduce no assumptions whatsoever about the agent environment, we can't for example assume that it contains a powerful oversight body that will help to keep the agent aligned. Obviously this stance is not very helpful if you want to make progress on certain alignment problems.

So I'm happy to see a post that encourages people to make explicit assumptions about the agent's environment. I have definitely used this technique to make progress in my own work.

But.

When I look at your example of 'the strategy stealing assumption' as one useful assumption to add, it is very much not the default example that would first come to my mind. So I am wondering if you would even recommend the approach of adding the kind of default assumptions that I tend to add.

To make this more specific, in this post I introduce an agent design with three safety interlocks that are supposed to be helpful to agent oversight. The interlocks are agent design refinements that make it easier for oversight to keep control over the agent. The interlocks contribute to more successful oversight not by making the oversight people+machines smarter (which is your main line of research I believe), but by making the agent less smart in very specific ways.

But at the same time, these interlocks do not remove all possible worst-case failure stories of doom. To quote from the post and the underlying paper:

We can always construct a highly unlikely but theoretically possible failure mode story where the agent ends up being unsafe.

The key here is the 'highly unlikely'. If we have an algorithm were

We train a generative model that predicts these camera observations given the motor actions.

then I typically add the following assumption to avoid doom:

  • Assumption: Take a sequence of of disastrously unsafe motor actions, actions that that modify the agent compute core to remove the safety interlocks. We assume that, while the following mis-prediction scenario is theoretically possible, the risk that it actually happens is very low. The mis-prediction scenario is that, for the above motor actions that modify the compute core, the above generative model incorrectly predicts camera observations in which the agent compute core remains perfectly untouched by these actions, and where the observed paperclip production resulting from these actions is better than under any other sequence of actions.

In terms of methodology, I usually describe the above move as one where we seek to drive down the risk of certain failure modes to residual levels. There is a link to empirical work here. To make it more plausible that the above assumption about low risk is valid for a particular ML system and physical realization of an agent and its environment, we can do simulations and real-life experiments with trained generative models.

So my question is: would the above assumption-adding step, about the low risk of mis-predictions, be a natural and valid assumption-adding process step for 'throwing out failure stories' in your methodology?

Or is the existence of this assumption automatically implied by default in your process?

Formal Solution to the Inner Alignment Problem

I don't think this is a problem. There will be plenty of them, but when they're wrong they'll get removed from the posterior.

I have seen you mention a number of times in this comment thread that 'this is not a problem because eventually the bad/wrong policies will disappear from the top set'. You have not qualified this statement with 'but we need a very low like to make this work in a safe way', so I remain somewhat uncertain about your views are about how low needs to go.

In any case, I'll now try to convince you that if , your statement that 'when they're wrong they'll get removed from the posterior' will not always mean what you might want it to mean.

Is the demonstrator policy to get themselves killed?

The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for.

Say that is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where and where at time step , we have the likely case that . In the world I constructed for the counterexample, the remaining top policies now perform a synchronized treacherous turn where they kill the demonstrator.

In time step and later, the policies diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the action. This eventually removes all 'wrong' policies from , where 'wrong' means that they do not take the action at all future time steps.

The silver lining is perhaps that at least the agent will eventually stop, perform actions only, after it has killed the demonstrator.

Now. the paper proves that the behavior of the agent policy will approximate that of the true demonstrator policy closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy had nothing to do with producing paperclips, this was a wrong guess all along. The right demonstrator policy is one where the demonstrator always intended to get themselves killed.

This would be a somewhat unusual solution to the inner alignment problem.

The math in the paper has you working in a fixed-policy setting where the demonstrator policy is immutable/time-invariant. The snag is that this does not imply that the policy defines a behavioral trajectory that is independent of the internals of the agent construction. If the agent is constructed in a particular way and when it operates in a certain environment, it will force into a self-fulfilling trajectory where it kills the demonstrator.

Side note: if anybody is looking for alternative math that allows one to study and manage the interplay between a mutable time-dependent demonstrator policy and the agent policy, causal models seem to be the way to go. See for example here where this is explored in a reward learning setting.

Load More