Vanessa Kosoy

Vanessa Kosoy's Comments

Clarifying "AI Alignment"

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Why doesn't this also apply to subjective regret bounds?

In order to get a subjective regret bound you need to consider an appropriate prior. The way I expect it to work is, the prior guarantees that some actions are safe in the short-term: for example, doing nothing to the environment and asking only sufficiently quantilized queries from the user (see this for one toy model of how "safe in the short-term" can be formalized). Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.

Now, you can say "with the right prior intent-alignment also works". To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work. Indeed, we can imagine that the ontology on which the prior is defined includes a "true reward" symbol s.t., by definition, the semantics is whatever the user truly wants. An agent that maximizes expected true reward then can be said to be intent-aligned. If it's doing something bad from the user's perspective, then it is just an "innocent" mistake. But, unless we bake some specific assumptions about the true reward into the prior, such an agent can be anything at all.

Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space.

This is related to what I call the distinction between "weak" and "strong feasibility". Weak feasibility means algorithms that are polynomial time in the number of states and actions, or the number of hypotheses. Strong feasibility is supposed to be something like, polynomial time in the description length of the hypothesis.

It is true that currently we only have strong feasibility results for relatively simple hypothesis spaces (such as, support vector machines). But, this seems to me just a symptom of advances in heuristics outpacing the theory. I don't see any reason of principle that significantly limits the strong feasibility results we can expect. Indeed, we already have some advances in providing a theoretical basis for deep learning.

However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability. Instead, I prefer studying safety on the weak feasibility level until we understood everything important on this level, and only then trying to extend it to strong feasibility. This creates somewhat of a conundrum where apparently the one thing that can convince you (and other people?) is the thing I don't think should be done soon.

I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured "safety" without "passing the buck". I think subjective regret bounds and CIRL both make some progress on this, but somewhat "pass the buck" by requiring a well-specified hypothesis space for rewards / beliefs / observation models.

Can you explain what you mean here? I agree that just saying "subjective regret bound" is not enough, we need to understand all the assumptions the prior should satisfy, reflecting considerations such as, what kind of queries can or cannot manipulate the user. Hence the use of quantilization and debate in Dialogic RL, for example.

Vanessa Kosoy's Shortform

There is some similarity, but there are also major differences. They don't even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.

The reason I pointed out the relation to corrigibility is not because I think that's the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that "if you run this AI, this won't make things worse than not running the AI", no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.

From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can't (empirically and in the worst-case) prove a negative.

Clarifying "AI Alignment"

This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.

Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.

The acausal attack is an example of how it can happen for systematic reasons. As for the other part, that seems like conceding that intent-alignment is insufficient and you need "corrigibility" as another condition (also it is not so clear to me what this condition means).

If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Now, I do believe that if you set up the prior correctly then it won't happen, thanks to a mechanism like: Alpha knows that in case of dangerous uncertainty it is safe to fall back on some "neutral" course of action plus query the user (in specific, safe, ways). But this exactly shows that intent-alignment is not enough and you need further assumptions.

Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).

I guess you wouldn't count universality. Overall I agree.

Besides the fact ascription universality is not formalized, why is it equivalent to intent-alignment? Maybe I'm missing something.

I'm relatively pessimistic about mathematical formalization.

I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion.

I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid no-free-lunch theorems.

No, I make no such assumption. A bound on subjective regret ensures that running the AI is a nearly-optimal strategy from the user's subjective perspective. It is neither needed nor possible to prove that the AI can never enter a trap. For example, the AI is immune to acausal attacks to the extent that the user beliefs that the AI is not inside Beta's simulation. On the other hand, if the user beliefs that the simulation hypothesis needs to be taken into account, then the scenario amounts to legitimate acausal bargaining (which has its own complications to do with decision/game theory, but that's mostly a separate concern).

Clarifying "AI Alignment"

In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.

Rohin Shah's comment on the essay (which I believe is endorsed by Paul) reframes it as a particular way to decompose the AI safety problem. An often used decomposition is "definition-optimization": first we define what it means for an AI to be safe, then we understand how to implement a safe AI. In contrast, Paul's definition of alignment decomposes the AI safety problem as "motivation-competence": first we learn how to design AIs with good motivations, then we learn how to make them competent. Both Paul and Rohin argue that the "motivation" is the urgent part of the problem, the part on which technical AI safety research should focus.

In contrast, I will argue that the "motivation-competence" decomposition is not as useful as Paul and Rohin believe, and the "definition-optimization" decomposition is more useful.

The thesis behind the "motivation-competence" decomposition implicitly assumes a linear, one-dimensional scale of competence. Agents with good motivations and subhuman competence might make silly mistakes but are not catastrophically dangerous (since they are subhuman). Agents with good motivations and superhuman competence will only do mistakes that are "forgivable" in the sense that, our own mistakes would be as bad or worse. Ergo (the thesis concludes), good motivations are sufficient to solve AI safety.

However, in reality competence is multi-dimensional. AI systems can have subhuman skills in some domains and superhuman skills in other domains, as AI history showed time and time again. This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user. Moreover, there might be limits to the agent's knowledge about certain questions (such as, the user's preferences) that are inherent in the agent's epistemology (more on this below). Given such limits, the agent's competence becomes systematically lopsided. Furthermore, the elimination of such limits is as a large part of the "definition" part in the "definition-optimization" framing that the thesis rejects.

As a consequence of the multi-dimensional natural of competence, the difference between "well intentioned mistake" and "malicious sabotage" is much less clear than naively assumed, and I'm not convinced there is a natural way to remove the ambiguity. For example, consider a superhuman AI Alpha subject to an acausal attack. In this scenario, some agent Beta in the "multiverse" (= prior) convinces Alpha that Alpha exists in a simulation controlled by Beta. The simulation is set up to look like the real Earth for a while, making it a plausible hypothesis. Then, a "treacherous turn" moment arrives in which the simulation diverges from Earth, in a way calculated to make Alpha take irreversible actions that are beneficial for Beta and disastrous for the user.

In the above scenario, is Alpha "motivation-aligned"? We could argue it is not, because it is running the malicious agent Beta. But we could also argue it is motivtion-aligned, it just makes the innocent mistake of falling for Beta's trick. Perhaps it is possible to clarify the concept of "motivation" such that in this case, Alpha's motivations are considered bad. But, such a concept would depend in complicated ways on the agent's internals. I think that this is a difficult and unnatural approach, compared to "definition-optimization" where the focus is not on the internals but on what the agent actually does (more on this later).

The possibility of acausal attacks is a symptom of the fact that, environments with irreversible transitions are usually not learnable (this is the problem of traps in reinforcement learning, that I discussed for example here and here), i.e. it is impossible to guarantee convergence to optimal expected utility without further assumptions. When we add preference learning to the mix, the problem gets worse because now even if there are no irreversible transitions, it is not clear the agent will converge to optimal utility. Indeed, depending on the value learning protocol, there might be uncertainties about the user's preferences that the agent can never resolve (this is an example of what I meant by "inherent limits" before). For example, this happens in CIRL (even if the user is perfectly rational, this happens because the user and the AI have different action sets).

These difficulties with the "motivation-competence" framing are much more natural to handle in the "definition-optimization" framing. Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK). Specifically, the mathematical criteria of alignment I proposed are the "dynamic subjective regret bound" and the "dangerousness bound". The former is a criterion which simultaneous guarantees motivation-alignment and competence (as evidence that this criterion can be satisfied, I have the Dialogic Reinforcement Learning proposal). The latter is a criterion that doesn't guarantee competence in general, but guarantees specifically avoiding catastrophic mistakes. This makes it closer to motivation-alignment compated to subjective regret, but different in important ways: it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.

In summary, I am skeptical that "motivation" and "competence" can be cleanly separately in a way that is useful for AI safety, whereas "definition" and "optimization" can be so separated: for example the dynamic subjective regret bound is a "definition" whereas dialogic RL and putative more concrete implementations thereof are "optimizations". My specific proposals might have fatal flaws that weren't discovered yet, but I believe that the general principle of "definition-optimization" is sound, while "motivation-competence" is not.

Realism about rationality

It seems almost tautologically true that you can't accurately predict what an agent will do without actually running the agent. Because, any algorithm that accurately predicts an agent can itself be regarded as an instance of the same agent.

What I expect the abstract theory of intelligence to do is something like producing a categorization of agents in terms of qualitative properties. Whether that's closer to "momentum" or "fitness", I'm not sure the question is even meaningful.

I think the closest analogy is: abstract theory of intelligence is to AI engineering as complexity theory is to algorithmic design. Knowing the complexity class of a problem doesn't tell you the best practical way to solve it, but it does give you important hints. (For example, if the problem is of exponential time complexity then you can only expect to solve it either for small inputs or in some special cases, and average-case complexity tells you just whether these cases need to be very special or not. If the problem is in then you know that it's possible to gain a lot from parallelization. If the problem is in then at least you can test solutions, et cetera.)

And also, abstract theory of alignment should be to AI safety as complexity theory is to cryptography. Once again, many practical considerations are not covered by the abstract theory, but the abstract theory does tell you what kind of guarantees you can expect and when. (For example, in cryptography we can (sort of) know that a certain protocol has theoretical guarantees, but there is engineering work finding a practical implementation and ensuring that the assumptions of the theory hold in the real system.)

Realism about rationality

I think that ricraz claims that it's impossible to create a mathematical theory of rationality or intelligence, and that this is a crux, not so? On the other hand, the "momentum vs. fitness" comparison doesn't make sense to me. Specifically, a concept doesn't have to be crisply well-defined in order to use it in mathematical models. Even momentum, which is truly one of the "cripser" concepts in science, is no longer well-defined when spacetime is not asymptotically flat (which it isn't). Much less so are concepts such as "atom", "fitness" or "demand". Nevertheless, physicists, biologist and economists continue to successfully construct and apply mathematical models grounded in such fuzzy concepts. Although in some sense I also endorse the "strawman" that rationality is more like momentum than like fitness (at least some aspects of rationality).

Realism about rationality

In this essay, ricraz argues that we shouldn't expect a clean mathematical theory of rationality and intelligence to exist. I have debated em about this, and I continue to endorse more or less everything I said in that debate. Here I want to restate some of my (critical) position by building it from the ground up, instead of responding to ricraz point by point.

When should we expect a domain to be "clean" or "messy"? Let's look at everything we know about science. The "cleanest" domains are mathematics and fundamental physics. There, we have crisply defined concepts and elegant, parsimonious theories. We can then "move up the ladder" from fundamental to emergent phenomena, going through high energy physics, molecular physics, condensed matter physics, biology, geophysics / astrophysics, psychology, sociology, economics... On each level more "mess" appears. Why? Occam's razor tells us that we should prioritize simple theories over complex theories. But, we shouldn't expect a theory to be more simple than the specification of the domain. The general theory of planets should be simpler than a detailed description of planet Earth, the general theory of atomic matter should be simpler than the theory of planets, the general theory of everything should be simpler than the theory of atomic matter. That's because when we're "moving up the ladder", we are actually zooming in on particular phenomena, and the information we need to specify "where to zoom in" is translated to the description complexity of theory.

What does it mean in practice about understanding messy domains? The way science solves this problem is by building a tower of knowledge. In this tower, each floor benefits from the interactions both with the floor above it and the floor beneath it. Without understanding macroscopic physics we wouldn't figure out atomic physics, and without figuring out atomic physics we wouldn't figure out high energy physics. This is knowledge "flowing down". But knowledge also "flows up": knowledge of high energy physics allows understanding particular phenomena in atomic physics, knowledge of atomic physics allows predicting the properties of materials and chemical reactions. (Admittedly, some floors in the tower we have now are rather ramshackle, but I think that ultimately the "tower method" succeeds everywhere, as much as success is possible at all).

How does mathematics come in here? Importantly, mathematics is not used only on the lower floors of the tower, but on all floors. The way "messiness" manifests is, the mathematical models for the higher floors are either less quantitatively accurate (but still contain qualitative inputs) or have a lot of parameters that need to be determined either empirically, or using the models of the lower floors (which is one way how knowledge flows up), or some combination of both. Nevertheless, scientists continue to successfully build and apply mathematical models even in "messy" fields like biology and economics.

So, what does it all mean for rationality and intelligence? On what floor does it sit? In fact, the subject of rationality of intelligence is not a single floor, but its own tower (maybe we should imagine science as a castle with many towers connected by bridges).

The foundation of this tower should be the general abstract theory of rationality. This theory is even more fundamental than fundamental physics, since it describes the principles from which all other knowledge is derived, including fundamental physics. We can regard it as a "theory of everything": it predicts everything by making those predictions that a rational agent should do. Solomonoff's theory and AIXI are a part of this foundation, but not all it. Considerations like computational resource constraints should also enter the picture: complexity theory teaches us that they are also fundamental, they don't requiring "zooming in" a lot.

But, computational resource constrains are only entirely natural when they are not tied to a particular model of computation. This only covers constraints such as "polynomial time" but not constraints such as time and even less so time. Therefore, once we introduce a particular model of computation (such as a RAM machine), we need to build another floor in the tower, one that will necessarily be "messier". Considering even more detailed properties of the hardware we have, the input/output channels we have, the goal system, the physical environment and the software tools we employ will correspond to adding more and more floors.

Once we agree that it shoud be possible to create a clean mathematical theory of rationality and intelligence, we can still debate whether it's useful. If we consider the problem of creating aligned AGI from an engineering perspective, it might seem for a moment that we don't really need the bottom layers. After all, when designing an airplane you don't need high energy physics. Well, high energy physics might help indirectly: perhaps it allowed predicting some exotic condensed matter phenomenon which we used to make a better power source, or better materials from which to build the aircraft. But often we can make do without those.

Such an approach might be fine, except that we also need to remember the risks. Now, safety is part of most engineering, and is definitely a part of airplane design. What level of the tower does it require? It depends on the kind of risks you face. If you're afraid the aircraft will not handle the stress and break apart, then you need mechanics and aerodynamics. If you're afraid the fuel will combust and explode, you better know chemistry. If you're afraid a lightning will strike the aircraft, you need knowledge of meteorology and electromagnetism, possibly plasma physics as well. The relevant domain of knowledge, and the relevant floor in the tower is a function of the nature of the risk.

What level of the tower do we need to understand AI risk? What is the source of AI risk? It is not in any detailed peculiarities of the world we inhabit. It is not in the details of the hardware used by the AI. It is not even related to a particular model of computation. AI risk is the result of Goodhart's curse, an extremely general property of optimization systems and intelligent agents. Therefore, addressing AI risk requires understanding the general abstract theory of rationality and intelligence. The upper floors will be needed as well, since the technology itself requires the upper floors (and since we're aligning with humans, who are messy). But, without the lower floors the aircraft will crash.

Vanessa Kosoy's Shortform

Some thoughts about embedded agency.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent[1]. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.

First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology of the external world. Formally, such an ontology can be an incomplete[2] Markov chain, the reward function being a function of the state. Examples:

  • The Markov chain is a representation of known physics (or some sector of known physics). The reward corresponds to the total mass of diamond in the world. To make this example work, we only need enough physics to be able to define diamonds. For example, we can make do with quantum electrodynamics + classical gravity and have the Knightian uncertainty account for all nuclear and high-energy phenomena.

  • The Markov chain is a representation of people and social interactions. The reward correspond to concepts like "happiness" or "friendship" et cetera. Everything that falls outside the domain of human interactions is accounted by Knightian uncertainty.

  • The Markov chain is Botworld with a the some of the rules left unspecified. The reward is the total number of a particular type of item.

Now we need to somehow connect the agent to the ontology. Essentially we need a way of drawing Cartesian boundaries inside the (a priori non-Cartesian) world. We can accomplish this by specifying a function that assigns an observation and projected action to every state out of some subset of states. Entering this subset corresponds to agent creation, and leaving it corresponds to agent destruction. For example, we can take the ontology to be Botworld + marked robot and the observations and actions be the observations and actions of that robot. If we don't want marking a particular robot as part of the ontology, we can use a more complicated definition of Cartesian boundary that specifies a set of agents at each state plus the data needed to track these agents across time (in this case, the observation and action depend to some extent on the history and not only the current state). I will leave out the details for now.

Finally, we need to define the prior. To do this, we start by choosing some prior over refinements of the ontology. By "refinement", I mean removing part of the Knightian uncertainty, i.e. considering incomplete hypotheses which are subsets of the "ontological belief". For example, if the ontology is underspecified Botworld, the hypotheses will specify some of what was left underspecified. Given such a "objective" prior and a Cartesian boundary, we can construct a "subjective" prior for the corresponding agent. We transform each hypothesis via postulating that taking an action that differs from the projected action leads to "Nirvana" state. Alternatively, we can allow for stochastic action selection and use the gambler construction.

Does this framework guarantee effective planning for death? A positive answer would correspond to some kind of learnability result (regret bound). To get learnability, will first need that the reward is either directly on indirectly observable. By "indirectly observable" I mean something like with semi-instrumental reward functions, but accounting for agent mortality. I am not ready to formulate the precise condition atm. Second, we need to consider an asymptotic in which the agent is long lived (in addition to time discount being long-term), otherwise it won't have enough time to learn. Third (this is the trickiest part), we need the Cartesian boundary to flow with the asymptotic as well, making the agent "unspecial". For example, consider Botworld with some kind of simplicity prior. If I am a robot born at cell zero and time zero, then my death is an event of low description complexity. It is impossible to be confident about what happens after such a simple event, since there will always be competing hypotheses with different predictions and a probability that is only lower by a factor of . On the other hand, if I am a robot born at cell 2439495 at time 9653302 then it would be surprising if the outcome of my death would be qualitatively different from the outcome of the death of any other robot I observed. Finding some natural, rigorous and general way to formalize this condition is a very interesting problem. Of course, even without learnability we can strive for Bayes-optimality or some approximation thereof. But, it is still important to prove learnability under certain conditions to test that this framework truly models rational reasoning about death.

Additionally, there is an intriguing connection between some of these ideas and UDT, if we consider TRL agents. Specifically, a TRL agent can have a reward function that is defined in terms of computations, exactly like UDT is often conceived. For example, we can consider an agent whose reward is defined in terms of a simulation of Botworld, or in terms of taking expected value over a simplicity prior over many versions of Botworld. Such an agent would be searching for copies of itself inside the computations it cares about, which may also be regarded as a form of "embeddedness". It seems like this can be naturally considered a special case of the previous construction, if we allow the "ontological belief" to include beliefs pertaining to computations.


  1. Unless it's some kind of modification that we treat explicitly in our model of the agent, for example a TRL agent reprogramming its own envelope. ↩︎

  2. "Incomplete" in the sense of Knightian uncertainty, like in quasi-Bayesian RL. ↩︎

2019 AI Alignment Literature Review and Charity Comparison

Thank you for writing this impressive review!

Some comments on MIRI's non-disclosure policy.

First, some disclosure :) My research is funded by MIRI. On the other hand, all of my opinions are my own and do not represent MIRI or anyone else associated with MIRI.

The non-disclosure policy has no direct effect on me, but naturally, both before and after it was promulgated, I used my own judgement to decide what should or should not be made public. The vast majority of my work I do make public (subject only to the cost of time and effort to write and explain it), because if I think something would increase risk rather than reduce it[1], then I don't pursue this line of inquiry in the first place. Things I don't make public are mostly early stage ideas that I don't develop.

I think it is fair enough to judge AI alignment orgs only by the public output they produce. However, this doesn't at all follow that a non-disclosure policy leads to immediate disqualification, like you seem to imply. You can judge an org by its public output whether or not all of its output is public. This is somewhat similar to the observation that management overhead is a bad metric. Yes, some of your money goes into something that doesn't immediately and directly translate to benefit. All else equal, you want that not to happen. But all else is not equal, and can never be equal.


  1. This is completely tangential, but I think we need more public discussion on how do we decide whether making something public is beneficial vs. detrimental. ↩︎

Clarifying Power-Seeking and Instrumental Convergence

One idea how this formalism can be improved, maybe. Consider a random directed graph, sampled from some "reasonable" (in some sense that needs to be defined) distribution. We can then define "powerful" vertices as vertices from which there are paths to most other vertices. Claim: With high probability over graphs, powerful vertices are connected "robustly" to most vertices. By "robustly" I mean that small changes in the graph don't disrupt the connection. This is because, if your vertex is connected to everything, then disconnecting some edges should still leave plenty of room for rerouting through other vertices. We can then interpret it as saying, gaining power is more robust to inaccuracies of the model or changes in the circumstances than pursuing more "direct" paths to objectives.

Load More