## AI ALIGNMENT FORUMAF

Koen Holtman

Computing scientist and Systems architect

# Posts

Sorted by New

Some AI research areas and their relevance to existential safety

Nice post!  In particular, I like your reasoning about picking research topics:

The main way I can see present-day technical research benefiting existential safety is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise over the next 10-30 years.  In short, there often needs to be some amount of traction on a technical area before it’s politically viable for governing bodies to demand that institutions apply and improve upon solutions in those areas.

I like this as a guiding principle, and have used it myself, though my choices have also been driven in part by more open-ended scientific curiosity.  But when I apply the above principle, I get to quite different conclusions about recommended research areas.

As a specific example, take the problem of oversight of companies that want to create of deploy strong AI: the problem of getting to a place where society has accepted and implemented policy proposals that demand significant levels of oversight for such companies.  In theory, such policy proposals might be held back by a lack of traction in a particular technical area, but I do not believe this is a significant factor in this case.

To illustrate, here are some oversight measures that apply right now to companies that create medical equipment, including diagnostic equipment that contains AI algorithms. (Detail: some years ago I used to work in such a company.) If the company wants to release any such medical technology to the public, it has to comply with a whole range of requirements about documenting all steps taken in development and quality assurance.  A significant paper trail has to be created, which is subject to auditing by the regulator.  The regulator can block market entry if the processes are not considered good enough.  Exactly the same paper trail + auditing measures could be applied to companies that develop powerful non-medical AI systems that interact with the public.  No technical innovation would be necessary to implement such measures.

So if any activist group or politician wants to propose measures to improve oversight of AI development and use by companies (either motivated by existential safety risks or by a more general desire to create better outcomes in society), there is no need for them to wait for further advances in Interpretability in ML (IntML), Fairness in ML (FairML) or Accountability in ML (AccML) techniques.

To lower existential risks from AI, it is absolutely necessary to locate proposals for solutions which are technically tractable.  But to find such solutions, one must also look at low-tech and different-tech solitions that go beyond the application of even more AI research.  The existence of tractable alternative solutions to make massive progress leads me to down-rank the three AI research areas I mention above, at least when considered from a pure existential safety perspective.  The non-existence of alternatives also leads me to up-rank other areas (like corrigibility) which are not even mentioned in the original post.

I like the idea of recommending certain fields for their educational value to existential-safety-motivated researchers. However, I would also recommend that such researchers read broadly beyond the CS field, to read about how other high-risk fields are managing (or have failed to manage) to solve their safety and governance problems.

I believe that the most promising research approach for lowering AGI safety risk is to find solutions that combine AI research specific mechanisms with more general mechanisms from other fields, like the use of certain processes which are run by humans.

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

In that comment I focus on corrigibility related work that has appeared as scientific papers and/or arxiv preprints.

Tradeoff between desirable properties for baseline choices in impact measures
By "semantic models that are rich enough", do you mean that the AI might need a semantic model for the power of other agents in the environment?

Actually in my remarks above I am less concerned about how rich a model the AI may need. My main intuition is that we ourselves may need a semantic model for that describes the comparable power of several players, if our goal is to understand motivations towards power more deeply and generally.

To give a specific example from my own recent work: in working out more details about corrigibility and indifference, I ended up defining a safety property 2 (S2 in the paper) that is about control. Control is a form of power: if I control an agent's future reward function, I have power over the agent, and indirect power over the resources it controls. To define safety property 2 mathematically, I had to make model extensions that I did not need to make to define or implement the reward function of the agent itself. So by analogy, if you want to understand and manage power seeking in an n-player setting, you may end up needing to define model extensions and metrics that are not present inside the reward functions or reasoning systems of each player. You may need them to measure, study, or define the nature of the solution.

The interesting paper you mention gives a kind-of example of such a metric, when it defines an equality metric for its battery collecting toy world, an equality metric that is not (explicitly represented) inside the agent's own semantic model. For me, an important research challenge is to generalise such toy-world specific safety/low-impact metrics into metrics that can apply to all toy (and non-toy) world models.

Yet I do not see this generalisation step being done often, and I am still trying to find out why not. Partly I think I do not see it often because it is mathematically difficult. But I do not think that is the whole story. So that is one reason I have been asking opinions about semantic detail.

In one way, the interesting paper you mention goes in a direction that is directly counter to the one I feel is the most promising one. The paper explicitly frames its solution as a proposed modification of a specific deep Q-learning machine learning algorithm, not as an extension to the reward function that is being supplied to this machine learning algorithm. By implication, this means they add more semantic detail inside the machine learning code, while keeping it out of it out of the reward function. My preference is to extend the reward function if at all possible, because this produces solutions that will generalise better over current and future ML algorithms.

Tradeoff between desirable properties for baseline choices in impact measures

Thanks for clarifying your view! I agree that for point 1 above, less semantic structure should be needed.

Reading some of the links above again, I still feel that we might be having different views on how much semantic structure is needed. But this also depends on what you count as semantic structure.

To clarify where I am coming from, I agree with the thesis of your paper Optimal Farsighted Agents Tend to Seek Power. I am not in the camp which, to quote the abstract of the paper, 'voices scepticism' about emergent power seeking incentives.

But me the, the main mechanism that turns power seeking incentives into catastrophic power-seeking is when at least two power-seeking entities with less than 100% aligned goals start to interact with each other in the same environment. So I am looking for semantic models that are rich enough to capture at least 2 players being present in the environment.

I have the feeling that you believe that moving to the 2-or-more-players level of semantic modelling is of lesser importance, is in fact a distraction, that we may be able to solve things cleanly enough if we just make every agent not seek power too much. Or maybe you are just prioritizing a deeper dive in that particular direction initially?

Tradeoff between desirable properties for baseline choices in impact measures

Thanks for the clarification, I think our intuitions about how far you could take these techniques may be more similar than was apparent from the earlier comments.

You bring up the distinction between semantic structure that is learned via unsupervised learning, and semantic structure that comes from 'explicit human input'. We may be using the term 'semantic structure' in somewhat different ways when it comes to the question of how much semantic structure you are actually creating in certain setups.

If you set up things to create an impact metric via unsupervised learning, you still need to encode some kind of impact metric on the world state by hand, to go into the agents's reward function, e.g. you may encode 'bad impact' as the observable signal 'the owner of the agent presses the do-not-like feedback button'. For me, that setup uses a form of indirection to create an impact metric that is incredibly rich in semantic structure. It is incredibly rich because it indirectly incorporates the impact-related semantic structure knowledge that is in the owner's brain. You might say instead that the metric does not have a rich of semantic structure at all, because it is just a bit from a button press. For me, an impact metric that is defined as 'not too different from the world state that already exists' would also encode a huge amount of semantic structure, in case the world we are talking about is not a toy world but the real world.

Tradeoff between desirable properties for baseline choices in impact measures

Reading the above, I am reminded of a similar exchange about the need for semantic structure between Alex Turner and me here, so I'd like to get to the bottom of this. Can you clarify your broader intuitions about the need or non-need for semantic structure? (Same question goes to Alex.)

Frankly, I expected you would have replied to Stuart's comment with a statement like the following: 'using semantic structure in impact measures is a valid approach, and it may be needed to encode certain values, but in this research we are looking at how far we can get by avoiding any semantic structure'. But I do not see that.

Instead, you seem to imply that leveraging semantic structure is never needed when further scaling impact measures. It looks like you feel that we can solve the alignment problem by looking exclusively at 'model-free' impact measures.

To make this more specific, take the following example. Suppose a mobile AGI agent has a choice between driving over one human, driving over P pigeons, or driving over C cats. Now, humans have very particular ideas about how they value the lives of humans, pigeons, and cats, and would expect that those ideas are reflected reasonably well in how the agent computes its impact measure. You seem to be saying that we can capture all this detail by just making the right tradeoffs between model-free terms, by just tuning some constants in terms that calculate 'loss of options by driving over X'.

Is this really what you are saying?

I have done some work myself on loss-of-options impact measures (see e.g. section 12 of my recent paper here). My intuition about how far you can scale these 'model-free' techniques to produce human-morality-aligned safety properties in complex environments seems to be in complete disagreement with your comments and those made by Alex.

New paper: AGI Agent Safety by Iteratively Improving the Utility Function

Thanks!

I'm not sure if I fully understood the section on machine learning. Is the main idea that you just apply the indifference correction at every timestep, so that the agent always acts as if it believes that use of the terminal does nothing?

Yes, the learning agent also applies the indifference-creating balancing term at each time step. I am not sure if there is a single main idea that summarizes the learning agent design -- if there had been a single main idea then the section might have been shorter. In creating the learning agent design I combined several ideas and techniques, and tweaked their interactions until I had something that provably satisfies the safety properties.

What about the issue that "the terminal does nothing" is actually a fact that has impacts on the world, which might produce a signal in the training data?

As a general rule, inside the training data gathered during previous time steps, it will be very visible that the signals coming from the input terminal, and any changes in them, will have an effect on the agent's actions.

This is not a problem, but to illustrate why not I will first describe an alternative learning agent design where it would be a problem. Consider a model-free advanced Q-learning type agent, which uses the decision making policy of 'do more of what earlier versions of myself did when they got high reward signals'. If such an agent has the container reward function defined in the paper, then if the training record implies the existence of attractive wireheading options, these might well be used. If the Q-learner has the container reward function, then the policy process end up with an emergent drive to revert any updates made via the input terminal, so that the agent gets back to a set of world states which are more familiar territory. The agent might also want to block updates, for the same reason. But the learning agent in the paper does not use this Q-learning type of decision making policy.

The agent in the paper takes actions using a reasoning process that is different from do what earlier versions of myself did when..'. Before I try to describe it, first a disclaimer. Natural language analogies to human reasoning are a blunt tool for describing what happens in the learning agent: this agent has too many moving and partially self-referential parts inside to capture them all in a single long sentence. That being said, the learning agent's planning process is like 'do what a hypothetical agent would do, in the world you have learned about, under the counterfactual assumption that the payload reward function of that hypothetical agent will never change, no matter what happens at the input terminal you also learned about'.

In section 11 of the paper, I describe the above planning process as creating a form of bureaucratic blindness. By design, the process simply ignores some of the information in the training record: this information is simply not relevant to maximizing the utility that needs to be maximized.

The analogy is that if you tell a robot that not moving is safe behavior, and get it to predict what happens in safe situations, it will include a lot of "the humans get confused why the robot isn't moving and try to fix it" in its predictions. If the terminal actually does nothing, humans who just used the terminal will see that, and will try to fix the robot, as it were. This creates an incentive to avoid situations where the terminal is used, even if it's predicted that it does nothing.

I think what you are describing above is the residual manipulation incentive in the second toy world of section 6 of the paper. This problem also exists for optimal-policy agents that have nothing left to learn', so it is an emergent effect that is unrelated to machine learning.

An overview of 11 proposals for building safe advanced AI

Thanks for the post! Frankly this is a sub-field of alignment that I have not been following closely, so it is very useful to have a high-level comparative overview.

I have a question about your thoughts on what 'myopia verification' means in practice.

Do you see 'myopia' as a single well-defined mathematical property that might be mechanically verified by an algorithm that tests the agent? Or is it a more general bucket term that means bad in a particular way', where a human might conclude, based on some gut feeling when seeing the output of a transparency tool, that the agent might not be sufficiently myopic?

What informs this question is that I can't really tell when I re-read your Towards a mechanistic understanding of corrigibility and the comments there. So I am wondering about your latest thinking.

Specification gaming: the flip side of AI ingenuity

In the TAISU unconference the original poster asked for some feedback:

I recently wrote a blog post with some others from the DM safety team on specification gaming. We were aiming for a framing of the problem that makes sense to reinforcement learning researchers as well as AI safety researchers. Haven't received much feedback on it since it came out, so it would be great to hear whether people here found it useful / interesting.

My thoughts: I feel that engaging/reaching out to the wider community of RL researchers is an open problem, in terms of scaling work on AGI safety. So great to see a blog post that also tries to frame this particular problem for a RL researcher audience.

As a member of the AGI safety researcher audience, I echo the comments of johnswenthworth : well-written, great graphics, but mostly stuff that was already obvious. I do like picture 'spectrum of unexpected solutions' a lot, this is an interesting way of framing the issues. So, can I read this post as a call to action for AGI safety researchers? Yes, because it identifies two open problem areas, 'reward design' and 'avoidance of reward tampering', with links.

Can I read the post as a call to action for RL researchers? Short answer: no.

If try to read the post from the standpoint of an RL researcher, what I notice most is the implication that work on 'RL algorithm design', on the right in the aligned RL agent design' illustrations has an arrow pointing to 'specification gaming is valid'. If I were an RL algorithm designer, I would read this as saying there is nothing I could contribute, if I stay in my own area of RL algorithm design expertise, to the goal of 'aligned RL agent design'.

So, is this the intended message that the blog post authors want to send to the RL researcher community? A non-call-to-action? Not sure. So this leaves me puzzled.

In the TAISU discussion we concluded that there is indeed one call to action for RL algorithm designers: the message that, if they are ever making plans to deploy an RL-based system to the real world, it is a good idea to first talk to some AI/AGI safety people about specification gaming risks.

Subagents and impact measures, full and fully illustrated

Nice! There are a lot of cases being considered here, but my main takeaway is that these impact measures have surprising loopholes, once the agent becomes powerful enough to construct sub-agents.

Mathematically, my main takeaway is that, for the impact measure PENALTY(s,a)= from Conservative Agency, if the agent wants to achieve the sub-goal while avoiding the penalty triggered by the term, it can build a sub-agent that is slightly worse at achieving than it it would be itself, and set it loose.

Now for some more speculative thoughts. I think the main source of the loophole above is the part , so what happens if we just delete that part? Then we get an agent with an incentive to stop any human present in the environment from becoming too good at achieving the goal , which would be bad. More informally, it looks like the penalty term has a loophole because it does not distinguish between humans and sub-agents.

Alice and Bob have a son Carl. Carl walks around and breaks a vase. Who is responsible?

Obviously, this depends on many factors, including Carl's age. To manage the real world, we weave a quite complex web to determine accountability.

In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.