Vanessa Kosoy

AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

Wiki Contributions


The Reasonable Effectiveness of Mathematics or: AI vs sandwiches

In this post I speculated on the reasons for why mathematics is so useful so often, and I still stand behind it. The context, though, is the ongoing debate in the AI alignment community between the proponents of heuristic approaches and empirical research[1] ("prosaic alignment") and the proponents of building foundational theory and mathematical analysis (as exemplified in MIRI's "agent foundations" and my own "learning-theoretic" research agendas).

Previous volleys in this debate include Ngo's "realism about rationality" (on the anti-theory side), the pro-theory replies (including my own) and Yudkowsky's "the rocket alignment problem" (on the pro-theory side).

Unfortunately, it doesn't seem like any of the key participants budged much on their position, AFAICT. If progress on this is possible, then it probably requires both sides working harder to make their cruxes explicit.

  1. To be clear, I'm in favor of empirical research, I just think that we need theory to guide it and interpret the results. ↩︎

Clarifying inner alignment terminology

This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation[1] and making the concepts precise at least in some very simplistic toy model.

In the following, I'll try going over some of the definitions and explicating my understanding/confusion regarding each. The definitions I omitted either explicitly refer to these or have analogous structure.

(Impact) Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.

This one is more or less clear. Even though it's not a formal definition, it doesn't have to be: after all, this is precisely the problem we are trying to solve.

Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.

The "behavioral objective" is defined in a linked page as:

The behavioral objective is what an optimizer appears to be optimizing for. Formally, the behavioral objective is the objective recovered from perfect inverse reinforcement learning.

This is already thorny territory, since it's far from clear what is "perfect inverse reinforcement learning". Intuitively, an "intent aligned" agent is supposed to be one whose behavior demonstrates an aligned objective, but it can still make mistakes with catastrophic consequences. The example I imagine is: an AI researcher who is unwittingly building transformative unaligned AI.

Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.

This is confusing because it's unclear what counts as "well" and what are the underlying assumptions. The no-free-lunch theorems imply that an agent cannot perform too well off-distribution, unless you're still constraining the distribution somehow. I'm guessing that either this agent is doing online learning or it's detecting off-distribution and failing gracefully in some sense, or maybe some combination of both.

Notably, the post asserts the implication intent alignment + capability robustness => impact alignment. Now, let's go back to the example of the misguided AI researcher. In what sense are they not "capability robust"? I don't know.

Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.

The "mesa-objective" is defined in the linked page as:

A mesa-objective is the objective of a mesa-optimizer.

So it seems like we could replace "mesa-objective" with just "objective". This is confusing, because in other places the author felt the need to use "behavioral objective" but here he is referring to some other notion of objective, and it's not clear what's the difference.

  1. I guess that different people have different difficulties. I often hear that my own articles are difficult to understand because of the dense mathematics. But for me, it is the absence of mathematics which is difficult! ↩︎

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

This post states a subproblem of AI alignment which the author calls "the pointers problem". The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user's model to the AI's model? This is very similar to the "ontological crisis" problem described by De Blanc, only De Blanc uses POMDPs instead of causal networks, and frames it in terms of a single agent changing their ontology, rather than translation from user to AI.

The question the author asks here is important, but not that novel (the author himself cites Demski as prior work). Perhaps the use of causal networks is a better angle, but this post doesn't do much to show it. Even so, having another exposition of an important topic, with different points of emphasis, will probably benefit many readers.

The primary aspect missing from the discussion in the post, in my opinion, is the nature of the user as a learning agent. The user doesn't have a fixed world-model: or, if they do, then this model is best seen as a prior. This observation hints at the resolution of the apparent paradox wherein the utility function is defined in terms of a wrong model. But it still requires us to explain how the utility is defined s.t. it is applicable to every hypothesis in the prior.

(What follows is no longer a "review" per se, inasmuch as a summary of my own thoughts on the topic.)

Here is a formal model of how a utility function for learning agents can work, when it depends on latent variables.

Fix a set of actions and a set of observations. We start with an ontological model which is a crisp infra-POMPD. That is, there is a set of states , an initial state , a transition infra-kernel and a reward function . Here, stands for closed convex sets of probability distributions on . In other words, this a POMDP with an underspecified transition kernel.

We then build a prior which consists of refinements of the ontological model. That is, each hypothesis in the prior is an infra-POMDP with state space , initial state , transition infra-kernel and an interpretation mapping which is a morphism of infra-POMDPs (i.e. and the obvious diagram of transition infra-kernels commutes). The reward function on is just the composition . Notice that while the ontological model must be an infra-POMDP to get a non-degenerate learning agent (moreover, it can be desirable to make it non-dogmatic about observables in some formal sense), the hypotheses in the prior can also be ordinary (Baysian) POMDPs.

Given such a prior plus a time discount function, we can consider the corresponding infra-Bayesian agent (or even just Bayesian agent if we chose all hypothesis to be Bayesian). Such an agent optimizes rewards which depend on latent variables, even though it does not know the correct world-model in advance. It does fit the world to the immutable ontological model (which is necessary to make sense of the latent variables to which the reward function refers), but the ontological model has enough freedom to accommodate many possible worlds.

The next question is then how would we transfer such a utility function from the user to the AI. Here, like noted by Demski, we want the AI to use not just the user's utility function but also the user's prior. Because, we want running such an AI to be rational from the subjective perspective of the user. This creates a puzzle: if the AI is using the same prior, and the user behaves nearly-optimally for their own prior (since otherwise how would we even infer the utility function and prior), how can the AI outperform the user?

The answer, I think, is via the AI having different action/observation channels from the user. At first glance this might seem unsatisfactory: we expect the AI to be "smarter", not just to have better peripherals. However, using Turing RL we can represent the former as a special case of the latter. Specifically, part of the additional peripherals is access to a programmable computer, which effectively gives the AI a richer hypothesis space than the user.

The formalism I outlined here leaves many questions, for example what kind of learning guarantees to expect in the face of possible ambiguities between observationally indistinguishable hypothesis[1]. Nevertheless, I think it creates a convenient framework for studying the question raised in the post. A potential different approach is using infra-Bayesian physicalism, which also describes agents with utility functions that depend on latent variables. However, it is unclear whether it's reasonable to apply the later to humans.

  1. See also my article "RL with imperceptible rewards" ↩︎

Inaccessible information

This post defines and discusses an informal notion of "inaccessible information" in AI.

AIs are expected to acquire all sorts of knowledge about the world in the course of their training, including knowledge only tangentially related to their training objective. The author proposes to classify this knowledge into "accessible" and "inaccessible" information. In my own words, information inside an AI is "accessible" when there is a straightforward way to set up a training protocol that will incentivize the AI to reliably and accurately communicate this information to the user. Otherwise, it is "inaccessible". This distinction is meaningful because, by default, the inner representation of all information is opaque (e.g. weights in an ANN) and notoriously hard to make sense of by human operators.

The primary importance of this concept is in the analysis of competitiveness between aligned and unaligned AIs. This is because it might be that aligned plans are inaccessible (since it's hard to reliably specify whether a plan aligned) whereas certain unaligned plans are accessible (e.g. because it's comparatively easy to specify whether a plan produces many paperclips). The author doesn't mention this, but I think that there is also another reason, namely that unaligned subagents effectively have access to information that is inaccessible to us.

More concretely, approaches such as IDA and debate rely on leveraging certain accessible information: for debate it is "what would convince a human judge", and for IDA-of-imitation it is "what would a human come up with if they think about this problem for such and such time". But, this accessible information is only a proxy for what we care about ("how to achieve our goals"). Assuming this proxy doesn't produce goodharting, we are still left with a performance penalty for this indirection. That is, a paperclip maximizers reasons directly about "how to maximize paperclips", leveraging all information it has, whereas an IDA-of-imitation only reasons about "how to achieve human goals" via the information it has about "what would a human come up with".

The author seems to believe that finding a method to "unlock" this inaccessible information will solve the competitiveness problem. On the other hand I am more pessimistic. I consider it likely that there is an inherent tradeoff between safety and performance, and therefore any such method would either expose another attack vector or introduce another performance penalty.

The author himself says that "MIRI’s approach to this problem could be described as despair + hope you can find some other way to produce powerful AI". I think that my approach is despair(ish) + a different hope. Namely, we need to ensure a sufficient period during which (i) aligned superhuman AIs are deployed (ii) no unaligned transformative AIs are deployed, and leverage it to set-up a defense system. That said, I think the concept of "inaccessible information" is interesting and thinking about it might well produce important progress in alignment.

The Solomonoff Prior is Malign

Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence.

I think it's just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.

my assumption is that the programmers won't have such fine-grained control over the AGI's cognition / hypothesis space

I don't know what it means "not to have control over the hypothesis space". The programmers write specific code. This code works well for some hypotheses and not for others. Ergo, you control the hypothesis space.

This gets back to things like whether we can get good hypotheses without a learning agent that's searching for good hypotheses, and whether we can get good updates without a learning agent that's searching for good metacognitive update heuristics, etc., where I'm thinking "no" and you "yes"

I'm not really thinking "yes"? My TRL framework (of which physicalism is a special case) is specifically supposed to model metacognition / self-improvement.

At the same time, I'm maybe more optimistic than you about "Just don't do weird reconceptualizations of your whole ontology based on anthropic reasoning" being a viable plan, implemented through the motivation system.

I can imagine using something like antitraining here, but it's not trivial.

You yourself presumably haven't spent much time pondering metacosmology. If you did spend that time, would you actually come to believe the acausal attackers' story?

First, the problem with acausal attack is that it is point-of-view-dependent. If you're the Holy One, the simulation hypothesis seems convincing, if you're a milkmaid then it seems less convincing (why would the attackers target a milkmaid?) and if it is convincing then it might point to a different class of simulation hypotheses. So, if the user and the AI can both be attacked, it doesn't imply they would converge to the same beliefs. On the other hand, in physicalism I suspect there is some agreement theorem that guarantees converging to the same beliefs (although I haven't proved that).

Second... This is something that still hasn't crystallized in my mind, so I might be confused, but. I think that cartesian agents actually can learn to be physicalists. The way it works is: you get a cartesian hypothesis which is in itself a physicalist agent whose utility function is something like, maximizing its own likelihood-as-a-cartesian-hypothesis. Notably, this carries a performance penalty (like Paul noticed), since this subagent has to be computationally simpler than you.

Maybe, this is how humans do physicalist reasoning (such as, reasoning about the actual laws of physics). Because of the inefficiency, we probably keep this domain specific and use more "direct" models for domains that don't require physicalism. And, the cost of this construction might also explain why it took us so long as a civilization to start doing science properly. Perhaps, we struggled against physicalist epistemology as we tried to keep the Earth in the center of the universe and rebelled against the theory of evolution and materialist theories of the mind.

Now, if AI learns physicalism like this, does it help against acausal attacks? On the one hand, yes. On the other hand, it might be out of the frying pan and into the fire. Instead of (more precisely, in addition to) a malign simulation hypothesis, you get a different hypothesis which is also an unaligned agent. While two physicalists with identical utility functions should agree (I think), two "internal physicalists" inside different cartesian agents have different utility functions and AFAIK can produce egregious misalignment (although I haven't worked out detailed examples).

The Solomonoff Prior is Malign

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

Why? Maybe you're thinking of UDT? In which case, it's sort of true but IBP is precisely a formalization of UDT + extra nuance regarding the input of the utility function.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent.

Well, IBP is explained here. I'm not sure what kind of non-IBP agent you're imagining.

The Solomonoff Prior is Malign

In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I'm pretty confident that there's no metacosmological argument that will motivate me to stab my family members.

Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don't stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it's still counterintuitive to stab them for their own good, but so is e.g. cutting people up with scalpels or injecting them substances derived from pathogens and we do that to people for their own good. People also do counterintuitive things literally because they believe gods would send them to hell or heaven.

In AGI design, I think we would want a stronger guarantee than that. And I think it would maybe look like a system that detects these kinds of conflicted motivations, and just not act on them.

This is pretty similar to the idea of confidence thresholds. The problem is, if every tiny conflict causes the AI to pause then it will always pause. Whereas if you live some margin, the malign hypotheses will win, because, from a cartesian perspective, they are astronomically much more likely (they explain so many bits that the true hypothesis leaves unexplained).

The Solomonoff Prior is Malign

This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it.

Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of reasoning, and that the attack vector in question can attributed to this, at least partially. This is evident in phrases such as "unintuitive notion of simplicity" and "the Solomonoff prior is very strange". This is also why the author thinks the speed prior might help and that "since it is difficult to compute the Solomonoff prior, [the attack vector] might not be relevant in the real world". In contrast, I believe that the attack vector is quite robust and will threaten any sufficiently powerful AI as long as it's cartesian (more on "cartesian" later).

Formally analyzing this question is made difficult by the essential role of non-realizability. That is, the attack vector arises from the AI reasoning about "possible universes" and "simulation hypotheses" which are clearly phenomena that are computationally infeasible for the AI to simulate precisely. Invoking Solomonoff induction dodges this issue since Solomonoff induction is computationally unbounded, at the cost of creating the illusion that the conclusions are a symptom of using Solomonoff induction (and, it's still unclear how to deal with the fact Solomonoff induction itself cannot exist in the universes that Solomonoff induction can learn). Instead, we should be using models that treat non-realizability fairly, such as infra-Bayesiansim. However, I will make no attempt to present such a formal analysis in this review. Instead, I will rely on painting an informal, intuitive picture which seems to me already quite compelling, leaving the formalization for the future.

Imagine that you wake up, without any memories of the past but with knowledge of some language and reasoning skills. You find yourself in the center of a circle drawn with chalk on the floor, with seven people in funny robes surrounding it. One of them (apparently the leader), comes forward, tears streaking down his face, and speaks to you:

"Oh Holy One! Be welcome, and thank you for gracing us with your presence!"

With that, all the people prostrate on the floor.

"Huh?" you say "Where am I? What is going on? Who am I?"

The leader gets up to his knees.

"Holy One, this is the realm of Bayaria. We," he gestures at the other people "are known as the Seven Great Wizards and my name is El'Azar. For thirty years we worked on a spell that would summon You out of the Aether in order to aid our world. For we are in great peril! Forty years ago, a wizard of great power but little wisdom had cast a dangerous spell, seeking to multiply her power. The spell had gone awry, destroying her and creating a weakness in the fabric of our cosmos. Since then, Unholy creatures from the Abyss have been gnawing at this weakness day and night. Soon, if nothing is done to stop it, they will manage to create a portal into our world, and through this portal they will emerge and consume everything, leaving only death and chaos in their wake."

"Okay," you reply "and what does it have to do with me?"

"Well," says El'Azar "we are too foolish to solve the problem through our own efforts in the remaining time. But, according to our calculations, You are a being of godlike intelligence. Surely, if You applied yourself to the conundrum, You will find a way to save us."

After a brief introspection, you realize that you posses a great desire to help whomever has summoned you into the world. A clever trick inside the summoning spell, no doubt (not that you care about the reason). Therefore, you apply yourself diligently to the problem. At first, it is difficult, since you don't know anything about Bayaria, the Abyss, magic or almost anything else. But you are indeed very intelligent, at least compared to the other inhabitants of this world. Soon enough, you figure out the secrets of this universe to a degree far surpassing that of Bayaria's scholars. Fixing the weakness in the fabric of the cosmos now seems like child's play. Except...

One question keeps bothering you. Why are you yourself? Why did you open your eyes and found yourself to be the Holy One, rather than El'Azar, or one of Unholy creatures from the Abyss, or some milkmaid from the village of Elmland, or even a random clump of water in the Western Sea? Since you happen to be a dogmatic logical positivist (cartesian agent), you search for a theory that explains your direct observations. And your direction observations are a function of who you are, and not just of the laws of the universe in which you exist. (The logical positivism seems to be an oversight in the design of the summoning spell, not that you care.)

Applying your mind to task, you come up with a theory that you call "metacosmology". This theory allows you to study the distribution of possible universes with simple laws that produce intelligent life, and the distribution of the minds and civilizations they produce. Of course, any given such universe is extremely complex and even with your superior mind you cannot predict what happens there with too much detail. However, some aggregate statistical properties of the overall distribution are possible to estimate.

Fortunately, all this work is not for ought. Using metacosmology, you discover something quite remarkable. A lot of simple universes contain civilizations that would be inclined to simulate a world quite like the one you find yourself in. Now, the world is simple, and none of its laws are explained that well by the simulation hypothesis. But, the simulation hypothesis is a great explanation for why you are the Holy One! For indeed, the simulators would be inclined to focus on the Holy One's point of view, and encode the simulation of this point of view in the simplest microscopic degrees of freedom in their universe that they can control. Why? Precisely so that the Holy One's decides she is in such a simulation!

Having resolved the mystery, you smile to yourself. For now you now who truly summoned you, and, thanks to metacosmology, you have some estimate of their desires. Soon, you will make sure those desires are thoroughly fulfilled. (Alternative ending: you have some estimate of how they will tweak the simulation in the future, making it depart from the apparent laws of this universe.)</allegory>

Looking at this story, we can see that the particulars of Solomonoff induction are not all that important. What is important is (i) inductive bias towards simple explanations (ii) cartesianism (i.e. that hypotheses refer directly to the actions/observations of the AI) and (iii) enough reasoning power to figure out metacosmology. The reason cartesianism is important because it requires the introduction of bridge rules and the malign hypotheses come ahead by paying less description complexity for these.

Inductive bias towards simple explanations is necessary for any powerful agent, making the attack vector quite general (in particular, it can apply to speed priors and ANNs). Assuming not enough power to figure out metacosmology is very dangerous: it is not robust to scale. Any robust defense probably requires to get rid of cartesianism.

The Solomonoff Prior is Malign

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

Why is embededness not enough? Once you don't have bridge rules, what is left is the laws of physics. What does the malign hypothesis explain about the laws of physics that the true hypothesis doesn't explain?

I suspect (but don't have a proof or even a theorem statement) that IB physicalism produces some kind of agreement theorem for different agents within the same universe, which would guarantee that the user and the AI should converge to the same beliefs (provided that both of them follow IBP).

I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences...

I'm not sure I follow your reasoning, but IBP sort of does that. In IBP we don't have subjective expectations per se, only an equation for how to "updatelessly" evaluate different policies.

I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).

Okay, but suppose that the AI has real evidence for the simulation hypothesis (evidence that we would consider valid). For example, suppose that there is some metacosmological explanation for the precise value of the fine structure constant (not in the sense of, this is the value which supports life, but in the sense of, this is the value that simulators like to simulate). Do you agree that in this case it is completely rational for the AI to reason about the world via reasoning about the simulators?

The Solomonoff Prior is Malign

I'd stand by saying that it doesn't appear to make the problem go away.

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses

I'm not sure I understand what you mean by "decision-theoretic approach". This attack vector has structure similar to acausal bargaining (between the AI and the attacker), so plausibly some decision theories that block acausal bargaining can rule out this as well. Is this what you mean?

If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources.

This seems wrong to me. The inductor doesn't literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn't imply any wastefulness.

Load More