Abstract

The question, how increasing intelligence of AIs influences central problems in AI safety remains neglected. We use the framework of reinforcement learning to discuss what continuous increases in the intelligence of AI systems implies for central problems in AI safety. We first argue that predicting the actions of an AI gets increasingly difficult for increasingly intelligent systems, and is impossible for sufficiently intelligent AIs. We then briefly argue that solving the alignment problem requires predicting features of the AI agent like its actions or goals. Based on this, we give a general argument that for increasingly intelligent AIs, the alignment problem becomes more difficult. In brief, this is the case since the increased capabilities of such systems require predicting the goals or actions of such systems increasingly precisely to ensure sufficient safety and alignment. We then consider specific problems of different approaches to the alignment problem. Finally, we conclude by discussing what this increasing difficulty means for the chances of catastrophic risks due to a potential failure to solve the alignment problem.

1. Introduction

Several papers have argued that sufficiently advanced artificial intelligence will be unpredictable and unexplainable [Yampolskiy, 2020a, 2020b, 2022]. Within AI safety, there further seems to be broad agreement that AI alignment, that is broadly speaking making AIs try to do what we want, becomes more difficult as AI systems get more intelligent. However, the arguments for unpredictability have been brief and rather informal while not being integrated with current machine learning concepts. Further, the argument that alignment becomes more difficult with increasing intelligence, has mostly been informally motivated by various potential failure modes such as instrumental convergence, misgeneralization, intelligence explosion, or deceptive alignment rather than an overarching theoretical account. Both arguments could have gotten more agreement, possibly for that reason.

Additionally, although many people claim to be more optimistic about a world in which AI capabilities are developed gradually, so that we may address the ensuing safety challenges in a continuous manner, most arguments mentioned above do not incorporate this continuous development but proceed from notions like superintelligence, referring to agents that are (already) substantially smarter than humans in every domain. This is especially important as leading AI labs’ approach to AI safety substantially builds on the idea that we can address the safety challenges ensuing from the development of increasingly intelligent AI systems by doing the necessary AI safety engineering just in time, by centrally using future AI systems for that task [Leike, 2022a, 2022b; Leike et al., 2022]^[1]. To evaluate this approach with any rigor, we must understand how AI safety problems scale with increasingly intelligent AI systems.

What has been lacking is a more general theoretical account that explains why AI safety and alignment should generally become more difficult with the increasing intelligence of AI systems. We here aim to make this case. We use the conceptual basis of reinforcement learning (RL) to describe increases in intelligence more precisely and on that basis give a more formal argument for the unpredictability of the actions of advanced AI that builds on continuous increases in the intelligence of AI systems. We also show how the issues of predictability and alignment are deeply related since solving the alignment problem requires predicting parts of the agent like its behavior or goals which we want to align. We then discuss the AI alignment problem. We argue that this problem becomes more difficult for increasingly intelligent AIs as their increased capabilities necessitate more accurate predictions about their goals and highlight additional difficulties for specific approaches like interpretability.

2. Intelligence and world models

Let’s begin by characterizing core terms for the subsequent argument that alignment becomes more difficult with increasing intelligence. First, drawing from various traditions in psychology and cognitive science, we see intelligence as the general ability to act adaptively in many contexts toward plausible goals [Coelho Mollo, 2022; Legg and Hutter, 2007]. Note that this characterizes intelligence as a behavioral property and is neutral as to how it is achieved.
Importantly, we believe intelligent agents need to have a good model of their environment. Modeling or predicting the environment in which the agent acts is necessary for adaptive action and thereby intelligence. This is because, without a model of the environment, adaptive action for any significant amount of time is too unlikely. Intelligence then requires a good model, and intelligent agents have a good model of their environment.

Further, since we see the sufficient quality of the agent’s model of the environment as a necessary condition for adaptive action (intelligence), more intelligence generally implies an improved model. The quality of agents’ models of their environment scales/ correlates with their intelligence. Or in other words, increasingly intelligent agents have increasingly better models of their environment. We unpack this below, primarily using the formalisms of reinforcement learning (RL). Abstractly, we characterize intelligent agents as follows:

It takes observations and actions $A_{t}$ (at time point t)
It receives or generates a reward $R_{t}$ and tries to maximize the cumulative reward, commonly called return and denoted with $G$ , where $G_{t} = R_{t + 1} + R_{t + 2} + R_{t + 3} + \dots$
As the return is not directly accessible, the agent in state $s$ uses an expected return, called value $v$ where $v (s) = E [G_{t} | S_{t} = s]$ . It maximizes $v$ in order to maximize $G$ .
This value $v$ depends both on the actions and the dynamics of its environment. Simple agents in simple environments only need a mapping from states to actions, called a policy $π$ .
However, for the agent to maximize $v$ and thereby behave adaptively in a complex environment, it needs to predict or model its environment and have a policy $π$ that respectively maximizes $v$ for different states of the environment. Thus, it needs a good model of its environment and how its actions in the environment lead to maximal values in $v$ (a good value function estimate).

Now, with this last point, we again conclude the point made less formally above: Adaptive action of an agent – and thereby intelligence – necessitates a good model in a complex environment. While in RL a model is not strictly necessary, agents with a predictive model of their environment have a better representation of their uncertainty which helps them learn and explore their environment effectively to maximize their future reward. In this way, the model in active inference, enables the agent to maximize both its epistemic value or information gain and its pragmatic value, thereby finding an efficient solution to the explore exploit tradeoff, and explore in an efficient, directed fashion. Further note that model-based RL agents do not have to rely as much on sometimes slow interaction with the world which has a variety of advantages including data efficiency, computational efficiency, speed, and reduced costs. In the end, intelligent agents need a general way to capture environmental dynamics to behave adaptively in complex, dynamic environments and this is done effectively by the model.

In the limit, for easy problems, the agent’s value function approaches the true value function and (the model toward) a function capturing the environmental dynamics. We may also consider the reward to be another observation over which the agent has preferences instead of a special feature of the environment or agent [Sajid et al., 2021].

Expressed as subproblems in RL, the agent thus gets better at prediction (of the future for a given policy) and control (i.e., finding and choosing the policy with maximal expected reward). In such a simple environment, the agent will learn the dynamics of that world and how its actions create (maximal) rewards that it implements in its policy. Where the optimal policy of course depends on modeling the environment since the ideal policy is defined over values $v$ of a policy $π$ given a state $s$ : $π_{*} (s) = a r g m a x_{π} v_{π} (s)$ . And the agent would not know $v_{π} (s)$ (ahead of time) without predicting it.

The agent’s value function approaches the accurate value function for competent RL agents, since an accurate value function is known in RL to be necessary for optimal behavior and is commonly captured in the Bellman equation:

$v_{*} (s) = m a x_{a} E [R_{t + 1} + γ v_{*} (S_{t + 1}) | S_{t} = s, A_{t} = a]$ (1)

As is well-known in RL, learning a good model is fundamentally hard [OpenAI, 2018]. However, what distinguishes intelligent agents is that even in complex environments, they are able to behave adaptively towards their goals, i.e., maximize their value. We therefore know that by maximizing the reward, such agents are both getting closer to the actual value function and capturing the environmental dynamics in their model better.

In most (if not all) complex environments, there will be pressures beyond the accuracy of the model. Importantly, such environments give rise to pressure towards a compressed or simple model as well as pressure toward speed. We expect a pressure towards compressed models and agent states, where the agent’s state update function is $S_{t + 1} = u (S_{t}, A_{t}, R_{t + 1}, O_{t + 1})$ such that agent states are generally a good compression of the agent’s full history $H_{t}$ where $H_{t} = (O_{0}, A_{0}, R_{1}, \dots, O_{t - 1}, A_{t - 1}, R_{t}, O_{t})$ . This pressure comes especially from lower computational cost and data efficiency while the pressure towards speed should emerge directly from reward maximization. Though conceptually distinct, these two optimization targets might converge in practice. If we think of complexity in conjunction with accuracy, another central consideration is that simple models tend to generalize better and are easier to update toward. This tendency towards both accuracy and simplicity is for example elegantly captured in active inference and the free energy principle (which is a biologically plausible approximation of exact Bayesian inference) as a tradeoff between the complexity and accuracy of one’s model [Parr et al., 2022].

Now, the model is a way for the system to predict the dynamics of its environment as well as the actions that help it maximize its value. In RL lingo, we may think of the model as approximating a MDP. In other words, a model predicts the agent’s environmental dynamics and the value, different actions would likely achieve in the environment. To zoom in on the pressure towards simple models, this generative model will be characterized by a certain accuracy and computational complexity. If there is pressure to maximize accuracy (to maximize value) and additionally pressure to minimize computational complexity, then the ideal model would have the Kolmogorov complexity of the laws underlying the environmental dynamics and the value function: $C (x) = m i n_{p} {l e n g t h (p) : U (p) = x}$ . Further, given the pressures on model accuracy and computational complexity, an agent that maximizes its value via its intelligence would have a model that approaches this accuracy and Kolmogorov complexity, even if it is never reached.

3. Predictability and Alignment

We can use this framework we set up to think about predictability and the alignment problem for AI agents. There are two central aspects of agents one might want to predict. Firstly, and perhaps most obviously, one might want to predict the actions of such an AI system, while secondly, one might want to predict the internally represented goal or value function of the agent. The first approach would then aim to use the prediction of an AIs behavior to keep it under control, e.g. by putting it in an environment where it would not be harmful, or by changing it or turning it off once it tries to behave in harmful ways. These examples and the well-known associated problems show that this approach is most closely related to classical approaches at solving the AI control problem. One could however also try to use predicted actions as a basis to evaluate whether the AI’s behavior (and thereby presumably its value function) is compatible with human values to ensure that we only build agents for which that is true in the first place.

3.1 The increasing difficulty of prediction

Of course, predicting an AI’s behavior seems quite far stretched given our current understanding of AI systems. It has also been argued that this is strictly impossible for an advanced AI [Yampolskiy, 2020b]. Using the framework employed here, there seems to be a good argument to that conclusion. To see this, we have to think about how the challenge develops for more intelligent agents. In the RL framework employed here, to predict an agent’s behavior, one needs to account for the current state, its model of the environment, its (resulting) value function, and the resulting policy which specifies the agent's actions. Consequently, we at least need to be able to tell its state, model and value function to predict the policy and behavior that is downstream of these. We argued before that the agent’s model is central to its adaptive behavior and thereby its intelligence. Thus, more intelligent agents have a better model of their environment which we need to predict. Based on this alone, we can see that the task becomes increasingly difficult for increasingly intelligent AI systems. More intelligent agents have a better model of their environment so that the task of predicting the actions of a more intelligent AI agent has to involve capturing this increasingly better model. For this reason, predicting AI behavior gets increasingly difficult as we develop smarter systems. This is mirrored in the common surprise regarding the capabilities of more intelligent AIs that have been deployed, and in our inability to predict which capabilities AI systems will develop (e.g. jobs in creative industries seem much more at risk than most anticipated). Given the argument we just made, we should expect to be increasingly worse at predicting the capabilities of future AI systems.

Finally, this should also make it clear that the task of predicting AI behavior is not only increasingly hard but impossible for an agent that is sufficiently more intelligent than us. If the higher intelligence of an agent A means having an increasingly better model, then an agent B that is less intelligent will by definition have a worse model which won’t allow it to predict the behavior of A. We can also turn this into a proof by contradiction. Suppose that an agent B which is less intelligent than an agent A is nevertheless capable of predicting the behavior of A. Since the lower intelligence of B entails it to have a worse model than A, the less accurate model of B has to accurately predict the more accurate model of A. But that is a contradiction so that our initial assumption must have been wrong. Notice especially on that last version the similarity to the proof of the same statement by [Yampolskiy, 2020b]. The difference lies in the use of the agents’ models which we argued gets better with increasing intelligence.

To anticipate one answer: It could be that the two agents’ models also differ in their computational complexity besides their accuracy so that B could leverage its simpler and or faster algorithm in modeling A. Here, we refer back to our previous discussion of this point. We mentioned that we expect a pressure towards speed and a low computational complexity of the agent’s model in most environments. Further, more intelligent agents are the ones which optimized their policies further to their complex environment on the basis of a better model. All else being equal, we therefore expect them to also have had more exposure to these pressures so that inefficiencies are less pronounced in the more intelligent agent than the less intelligent one. Thus, while it is conceivable that a less intelligent agent could leverage its lower computational complexity and/ or higher speed, this seems implausible in practice. Independently, it also seems implausible in practice since artificial neural networks are already much faster than us.

3.2 The increasing difficulty of alignment

To approach AI safety in a perhaps more realistic way then, we’d have to give up on predicting the model of the agent and its behavior. Instead, following the second approach mentioned above, we could aim to find a solution to the alignment problem by predicting the value function. Here the aim is to predict the internally represented goal or value function $v$ of the AI system to ensure that it is compatible with human values and to only build such agents for which that is true. First, some remarks: 1) We need to predict the internally represented goal or value function since we do not know that the learned goal or value function of an AI system in question is sufficiently similar to the intended goal or value function [Hubinger et al., 2019]. This so-called inner alignment problem is well-recognized and indeed a core part of the worry here since for current AI systems, we cannot engineer goals or the right value function into them (even if we knew them). 2) It is important to steer free in this context of circularly defined goals or utility functions as e.g. used in active inference where they are generalization about the actual behavior (and defined by a system’s dynamical attractor). These are evidently not helpful here. 3) It’s worth emphasizing that in predicting the goal or value function and thereby giving up on predicting the agent's behavior, we are also giving up control. Whatever the agent does is instead centrally determined by its value function and we consequently have to be extremely confident in the AI having the right one.

Now for this alignment approach where we need to predict the AI’s goal/ value function, we first, either need to train the RL agent in such a way that we can predict its value function within sufficiently small confidence intervals or second, we need to have some way of assessing the agent’s value function via (mechanistic) interpretability of the AIs internals. Third and finally, we might again infer goals and or dangerous capabilities from its actual behavior but in an environment that is more secure, e.g. via red team evaluations. Ideally, we’d, of course, want to use all of these approaches in parallel, so that mistakes in the setup of the agent giving it its value function could be detected by the interpretability approach and/ or sufficiently secure evaluations before catastrophic outcomes.

It’s important to note that all of these alignment approaches are trying to predict (different parts of) the agent, namely its actions or its goal/ value function. This is because in order to determine if an AI is aligned with our goals, one needs to know what it is working toward, which in our view boils down to either predicting its behavior (which we discussed above) or predicting its goal/ value function.^[2]^[3] As a consequence, solving the alignment problem requires predicting different parts of the agent that we want to align to our goals. In this sense, a solution to the alignment problem depends on solving at least one of these prediction problems. Note that in our understanding, a solution to the problem naturally excludes the possibility that alignment happens by chance. A solution requires a well-grounded expectation that the AI’s behavior or goal matches the intended one and that there is no further deviation between the intended objective and what we really want.

Now, how does the alignment problem and the three approaches just mentioned scale with increasing intelligence? We think that most approaches will get more difficult with increasing intelligence of the AI system in question although part of the answer here is more uncertain than the previous one.

One central consideration for the increasing difficulty of alignment applies to both the first approach that tries to produce the right AI during training and the second approach which uses interpretability to try to predict the AI’s value function. Increased intelligence leads to a wider behavioral repertoire where the policy employed by the agent gives it a higher value. This follows again from the above considerations including our definition of intelligence. We started by defining intelligence as an ability to act adaptively across contexts toward plausible goals. The agent’s goal will be something that achieves a high value for the agent since the agent’s training procedure maximizes the expected return (and thereby value) and may allow policies that accept low initial rewards if they yield higher rewards later or policies that are hard to find in the search space, e.g., since they only work (well) for very precise actions. Although the agent arrived at a goal/ value function that produces high value in the training environment, this function might not exactly (or only) maximize the agent’s value. Thus, its actual policy $π (s)$ will be a proxy of $π_{*} (s) = a r g m a x_{π} v_{π} (s)$ and importantly, its value function $v (s)$ will also only be a proxy of the optimal value function expressed in the Bellman equation in (1).

This is again the inner alignment problem. Alternatively, we can imagine that the agent indeed ends up maximizing exactly the actual value function expressed in the Bellman equation but that this value function is not (exactly) what humans in fact wanted since the agent’s reward might not incorporate everything that humans value. Thus, there might be deviations between whatever goal we want to align our AI to, either due to first, the inner alignment problem, expressed here as a deviation from the optimal value function, or second, the outer alignment problem expressed here as an incomplete specification of the agent’s reward. Now, if this is the case, then we can see that increasing intelligence should generally make our problem more difficult. The reason is that the optimization of a fixed proxy objective O’ specified over $J < L$ attributes of the world where $L$ are the attributes that support the real objective O, maximizations of O’ will under plausible assumptions in the limit lead to a net negative (and indeed unboundedly negative) result in the utility of O [Zhuang and Hadfield-Menell, 2021].

How does this lead to the problem getting more difficult with increasing intelligence of AI? The problem is that an agent that is more capable at maximizing an objective will more likely enter this regime of overoptimization of O’ where the real utility for O falls off and into the net negative area. The agent is more capable at optimizing for its objective since that is what we mean by an increasing intelligence that is trained to maximize its value while it may optimize O’ instead of the intended objective O for the two aforementioned reasons. Due to this overoptimization, negative outcomes become more likely for an increasingly intelligent agent that is increasingly better at optimizing for its objective. The prediction of the agent’s goal therefore needs to be increasingly accurate for sufficient alignment of agents that are increasingly intelligent. Indeed, approaches that avoid final objectives for AI systems altogether, e.g. by allowing corrigibility or enforcing uncertainty over the AIs objective as in cooperative inverse reinforcement learning (CIRL) have been proposed because of this problem [Hadfield-Menell et al., 2016; Zhuang and Hadfield-Menell, 2021]. While evaluating these approaches is beyond the scope of this article, it’s worth noting that they do not get us out of the problem setup we described before. Corrigibility turns the problem into one where multiple iterations of predicting and choosing the right goal function are possible. CIRL forces us to set up a scheme of training the AI where we also have to be able to predict that it eventually converges on the right objective function after reducing its uncertainty over potential objective functions.

Let us turn to the first alignment approach by itself, which tries to predict the AI’s goal/ value function that results from its training. Here, we currently do not have a way to train AIs (or potentially RL agents) in such a way that we could tell whether the goal or value function of the AI will land within a certain confidence interval, nor can we tell what a sufficiently small confidence interval would be for safety. We cannot assume that the agent simply learns the intended reward function, since it is unclear whether it converges on the exact reward function. This specific issue of the inner alignment problem is discussed under goal misgeneralization [Langosco et al., 2021; Shah et al., 2022]. It is further unclear whether this approach would get more difficult with increasingly intelligent systems since we do not currently know how such an approach would work.

There is however a central reason to expect that predicting the AI’s goal/ value function will become more difficult for increasingly intelligent agents. The central reason is that the value function depends on the agent’s model. In general, the RL agents we are interested in here use models for planning, which means improving their values and policies. More specifically, the model is used to calculate expected returns (i.e., values) for different policies and environmental dynamics.
We said above that the (optimal) policy is chosen by maximizing the value: $π_{*} (s) = a r g m a x_{π} v_{π} (s)$
The agent has to predict the value of different policies $v_{π} (s)$ , where

$v_{π} (s) = E [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s, A_{t} \sim π (s)] .$ (2)

It compares these different policies and picks (the) one with the highest value $v_{π} (s)$ . So here, it’s the expectation $E$ in this equation that is centrally influenced by the model for the respective different policies. We can then denote the optimal value $v_{*} (s)$ with the Bellman equation as above in (1). In brief, the value function depends on expected future states and rewards which the agent determines with its model. Now, as we noted above, the model will approach the environmental dynamics and help estimate the value that different actions would likely achieve in the environment. As such, as we noted above in section 3.1, it is increasingly difficult to capture for increasingly intelligent agents. This however leads us to a situation reminiscent of the argument above on predictability. Predicting the value function should become more difficult for increasingly intelligent agents since it is harder to predict their model which influences the value function. This does not mean that good approximations are not possible. However, for the case of both precise predictions of the value function and sufficiently intelligent AI, it seems that we are also drawn the conclusion like above that this is impossible. We see this as the central reason for the increasing difficulty of this approach based on the RL lens employed here.
In addition, this approach may become increasingly difficult for increasingly intelligent AIs on the basis that the training of current foundation models has become more complex by including several consecutive training schemes. If this development persists or accelerates for increasingly intelligent AIs, it is plausible that this also makes the task of predicting the value function more difficult.

Regarding interpretability as a way to predict the value function of the AI agent, we first and foremost do not currently know how to decipher an AI’s internal processes in such a way as to tell what goal or value function it might have (if that exists in a non-trivial sense). While the field is extremely young, most research is on substantially smaller systems such as GPT-2 than the current cutting edge. Consequently, this approach relies on substantial advances in interpretability. It is also possible that interpretability will get more difficult with increasing intelligence of AI agents. An intuition for that is that it seems difficult to compress such a central property of increasingly intelligent agents that were already optimized for low computational complexity. We are however not confident in this, since as mentioned before, we do not know how to decipher the goal or value function of an AI system.

Third, read team evaluations are an increasingly important approach for AI safety and alignment. It is however not a solution to the actual safety problem but instead an attempt to point out cases of the problem when it has already emerged. As such, the approach becomes more important with increases in intelligence since it is in these cases that evaluations should find dangerous capabilities or otherwise alarming results. However, it is unclear if safety evaluations will work in later stages of AI development. This is because the AI may be smart enough to bypass them and the evaluations themselves may not keep pace with this development. People have different intuitions as to whether this point is reached quickly and depending on this, opinions on the value of evaluations vary rather strongly. Additionally, if the approach works and reveals dangerous capabilities, it might not put us in a substantially better position since we would still have to find a solution to the safety/ alignment problem and avoid it in the future.

4. Concluding remarks

We motivated our analysis into the increasing difficulty of AI safety on the following basis: Current AI safety approaches build on the idea that for the continuous development of both AI capabilities and AI safety techniques, we may address ensuing AI safety problems by using (current and) future AI systems for the task of AI safety engineering to solve the problems just in time. In regard to the alignment problem, the idea is then to keep the AI systems sufficiently aligned for each step in their development towards increasing intelligence [Leike et al., 2022]. To evaluate this approach, we need to know how the AI safety problems scale for increasingly intelligent AIs.

In answering this question in a general way, we used a psychologically informed understanding of intelligence and the conceptual tools of RL to argue that 1) predicting the actions of an AI gets increasingly difficult for an increasingly intelligent AI system and is impossible for an AI that is sufficiently intelligent. And 2) AI alignment also gets increasingly difficult for increasingly intelligent systems, where this argument built on the understanding that aligning a given AI requires predicting the AI’s behavior or its goal/ value function.

Now, how should these two conclusions influence the probability that humanity will suffer an existential catastrophe due to loss of control over an AGI system, in short, p(doom| AGI by 2070)? First, since any approach to AI safety has to overcome the increasing difficulty of the AI safety problems for increasingly intelligent AI, one should generally update toward a higher p(doom|AGI by 2070).^[4]
Second, we think that in particular, it should make us more skeptical of the roughly sketched just-in-time approaches that aim to use future AI to solve the AI safety problems. The reason is that while we know that the central AI safety problems, including alignment, get more difficult for increasingly intelligent AI, we do not know how such safety approaches scale with increasing intelligence since they have barely gotten started. To boil it down: We simply do not know what scales quicker, the difficulty of the alignment problem for increasingly intelligent AI, or progress on AI safety with help from current and future AI. We thus have to give a substantial probability that they do not scale well enough. As a result, (depending on how optimistic one was about such approaches) one has to update towards these approaches not working (due to them not scaling sufficiently^[5]). Since this is one of the central approaches to AI safety currently, one needs to raise one's probability for doom accordingly (unless one expects alignment or control by default).
Since this update to a higher probability for doom does not condition on a time frame, the central argument still holds if the development of AGI takes until 2070, so that one also has to increase one's p(doom|AGI by 2070). Naturally, the difference on longer time scales is that if safety approaches do not scale well enough, it is more likely that we find a different solution. As we mentioned before, these approaches do however also have to overcome the increasing difficulty of AI safety problems for increasingly intelligent AI, so that one still has to update to a higher p(doom|AGI by 2070) based on this. Longer timelines do however increase our chances of finding an approach that scales sufficiently.

Third, we considered reasons why solving the alignment problem under specific approaches might be very difficult. For several approaches, we run into the problem that (good) predictions of behavior or the value function (which we need for solving the alignment problem) have to incorporate the increasingly better model of the increasingly intelligent AI. This is bad news for our hope that safety approaches scale sufficiently and we currently do not know a good or approximate way around this. Concluding the obvious then, this should also increase one's p(doom|AGI by 2070) in proportion to how strong this case is for the different alignment approaches and how much weight one puts on the different approaches.

To emphasize the challenge we face which becomes clear when considering the just-in-time approach to AI safety using future AI systems: In order to seriously evaluate an approach to AI safety, we need to know whether the safety efforts we apply keep the goals and behaviors of future AI systems (indefinitely) in tight enough bounds. Evaluating this requires us to be able to predict the goals and behaviors of future AI systems within sufficiently small bounds in addition to a more precise quantitative account of how the safety problems scale with increasing intelligence of AI systems. Only on the basis of a comparison of both could we then seriously evaluate approaches that assume AI safety engineering to be possible just in time with future AI before substantial risks arise.^[6] Without this, we are flying blindly. Currently, we do unfortunately not have the necessary basis to trust, be optimistic, or (positively) excited about our AI safety approaches.

References

Coelho Mollo, D. (2022). Intelligent behaviour. Erkenntnis. https://doi.org/10.1007/
s10670-022-00552-8

Hadfield-Menell, D., Dragan, A., Abbeel, P., & Russell, S. (2016). Cooperative inverse
reinforcement learning. https://arxiv.org/pdf/1606.03137

Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks
from learned optimization in advanced machine learning systems. https://arxiv.
org/pdf/1906.01820

Krakovna, V., & Shah, R. (2023). [linkpost] some high-level thoughts on the deep-
mind alignment team’s strategy. https://www.lesswrong.com/posts/a9SPcZ6GXAg9cNKdi/linkpost-some-high-level-thoughts-on-the-deepmind-alignment

Langosco, L., Koch, J., Sharkey, L., Pfau, J., Orseau, L., & Krueger, D. (2021). Goal
misgeneralization in deep reinforcement learning. https://arxiv.org/pdf/2105.
14111

Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine intelligence.
Minds and Machines, 17 (4), 391–444. https://doi.org/10.1007/s11023-007-9079-x

Leike, J. (2022a). A minimal viable product for alignment: Bootstrapping a solution to
the alignment problem. https://aligned.substack.com/p/alignment-mvp

Leike, J. (2022b). Why i’m optimistic about our alignment approach: Some arguments
in favor and responses to common objections. https://aligned.substack.com/p/
alignment-optimism

Leike, J., Schulman, J., & Wu, J. (2022). Our approach to alignment research. https:
//openai.com/blog/our-approach-to-alignment-research

OpenAI. (2018). Part 2: Kinds of rl algorithms — spinning up documentation. https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

Parr, T., Pezzulo, G., & Friston, K. J. (2022). Active inference: The free energy principle
in mind, brain, and behavior. The MIT Press. https://directory.doabooks.org/
handle/20.500.12854/84566

Sajid, N., Ball, P. J., Parr, T., & Friston, K. J. (2021). Active inference: Demystified
and compared. Neural Computation, 33 (3), 674–712. https://doi.org/10.1162/
neco_a_01357

Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J., & Kenton,
Z. (2022). Goal misgeneralization: Why correct specifications aren’t enough for
correct goals. https://arxiv.org/pdf/2210.01790

Yampolskiy, R. V. (2020a). Unexplainability and incomprehensibility of ai. Journal of
Artificial Intelligence and Consciousness, 07 (02), 277–291. https://doi.org/10.
1142/S2705078520500150

Yampolskiy, R. V. (2020b). Unpredictability of ai: On the impossibility of accurately
predicting all actions of a smarter agent. Journal of Artificial Intelligence and
Consciousness, 07 (01), 109–118. https://doi.org/10.1142/S2705078520500034

Yampolskiy, R. V. (2022). On the controllability of artificial intelligence: An analysis of
limitations. Journal of Cyber Security and Mobility. https://doi.org/10.13052/
jcsm2245-1439.1132

Zhuang, S., & Hadfield-Menell, D. (2021). Consequences of misaligned ai. NeurIPS. https://arxiv.org/pdf/2102.03896

^{^}
Besides OpenAI, it is also a part of Google DeepMind's approach [Krakovna and Shah, 2023]. However, we generally refer to OpenAI's approach here since reliance on future AI is most prominent there.
^{^}
This is a conceptual truth. To determine whether an agent is aligned with oneself, one needs to have a reasonably good prediction of its behavior, its goal/ value function, or something of similar significance since that is what we mean by alignment.
^{^}
In the RL framework, one might also think of predicting the model or the policy of the agent. However, predicting the agent’s model does not seem helpful by itself to determine if the agent is aligned with one’s goals and faces the limit and problem discussed above when trying to predict the AI's behavior. Predicting the policy also faces this problem. Indeed, the argument above relied on the policy to argue for the unpredictability of the behavior of a sufficiently intelligent AI.
^{^}
Briefly, we here assume that sufficient misalignment leads to overoptimization and in the limit to unbounded negative utility as mentioned above. We assume this to be synonymous with an existential catastrophe due to loss of control over an AGI. Also note that the two conclusions mentioned and this argument for a higher p(doom|AGI by 2070) are more general than our initial motivation, which however makes the case especially poignant.
^{^}
Note that we do not discuss further reasons for skepticism of this approach here.
^{^}
And only on that basis could we give a specific $p (d o o m)$ . We here abstain from this since we are unable to quantitatively express how our safety approaches as well as the safety problems scale for increasingly intelligent AI.
^{^}
And give a specific $p (d o o m)$ . We here abstain from this since we are unable to quantitatively express how our safety approaches as well as the safety problems scale for increasingly intelligent AI.