Clarifying "AI Alignment"

As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is "doesn't manipulate the user" or something like that.) I'm not sure what 9, 11 and 13 are about. For the others, I'd say they're all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.

Hmm. I appreciate the effort, but I don't understand this answer. Maybe discussing this point further is not productive in this format.

I am not an expert but I expect that bridges are constructed so that they don't enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).

This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven't yet seen a bridge fall down from resonance, and so you don't think about it.

Yes, and in that perspective, the mathematical model can *tell* me about resonance. It's actually incredibly easy: resonance appears already in simple harmonic oscillators. Moreover, even if I did not explicitly understand resonance, if I proved that the bridge is stable under certain assumptions about external forces magnitudes and spacetime spectrum, it automatically guarantees that resonance will not crash the bridge (as long as the assumptions are realistic). Obviously people have not been so cautious over history, but that doesn't mean we should be careless about AGI as well.

I understand the argument that sometimes creating and analyzing a realistic mathematical model is difficult. I agree that under time pressure it might be better to compromise on a combination of *unrealistic* mathematical models, empirical data and informal reasoning. But I don't understand why should we give up so soon? We can work towards realistic mathematical models *and* prepare fallbacks, and even if we don't arrive at a realistic mathematical model it is likely that the effort will produce valuable insights.

Maybe I'm falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.

Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a "mathematical theory of how to write code".

First, if I am asked to check whether an element is in an array, or some other easy manipulation of data structures, I obviously don't *literally* start proving a theorem with pencil and paper. However, my not-fully-formal reasoning is such that I *could* prove a theorem if I wanted to. My model is not exactly "intuitive": I could explicitly explain every step. And, this is exactly how all of mathematics works! Mathematicians don't write proofs that are machine verifiable (some people do that today, but it's a novel and tiny fraction of mathematics). They write proofs that are good enough so that all the informal steps can be easily made formal by anyone with reasonable background in the field (but actually doing that would be very labor intensive).

Second, what I actually meant is examples like, I am using an algorithm to solve a system of linear equations, or find the maximal matching in a graph, or find a rotation matrix that minimizes the sum of square distances between two sets, because I have a *proof* that this algorithm works (or, in some cases, a proof that it at least produces the right answer when it converges). Moreover, this applies to problems that explicitly involve the physical world as well, such as Kalman filters or control loops.

Of course, in the latter case we need to make some assumptions about the physical world in order to prove anything. It's true that in applications the assumptions are often false, and we merely hope that they are good enough approximations. But, when the extra effort is justified, we can do better: we can perform a mathematical analysis of *how much* the violation of these assumptions affects the result. Then, we can use outside knowledge to verify that the violations are within the permissible margin.

Third, we could also literally prove machine-verifiable theorems about the code. This is called formal verification, and people do that sometimes when the stakes are high (as they definitely are with AGI), although in this case I have no personal experience. But, this is just a "side benefit" of what I was talking about. We need the mathematical theory to know that our algorithms are safe. Formal verification "merely" tells us that the implementation doesn't have bugs (which is something we should definitely worry about too, when it becomes relevant).

I'm curious what you think doesn't require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don't have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that's how I interpret the rocket alignment and security mindset posts.)

I'm not sure about the scope of your question? I made a sandwich this morning without building mathematical theory :) I think that the AI safety community definitely produced some important arguments about AI risk, and these arguments are valid evidence. But, I consider most of the big questions to be far from settled, and I don't see how they *could* be settled only with this kind of reasoning.

Instrumental Occam?

First, if we take PSRL as our model algorithm, then at any given time we follow a policy optimal for some hypothesis sampled out of the belief state. Since our prior favors simple hypotheses, the hypothesis we sampled is likely to be simple. But, given a hypothesis of description complexity , the corresponding optimal policy has description complexity , since the operation "find the optimal policy" has description complexity.

Taking computational resource bounds into account makes things more complicated. For some computing might be intractable, even though itself is "efficiently computable" in some sense. For example we can imagine an that has exponentially many states plus some succinct description of the transition kernel.

One way to deal with it is using some heuristic for optimization. But, then the description complexity is still .

Another way to deal with it is restricting ourselves to the kind of hypotheses for which *is* tractable, but allowing incomplete/fuzzy hypotheses, so that we can still deal with environments whose complete description falls outside this class. For example, this can take the form of looking for some small subset of features that has predictable behavior that can be exploited. In this approach, the description complexity is *probably* still something like , where this time is incomplete/fuzzy (but I don't yet know how PSRL for incomplete/fuzzy hypothesis should work).

Moreover, using incomplete models we can in some sense go in the other direction, from policy to model. This might be a good way to think of model-based RL. In actor-critic algorithms, our network learns a pair consisting of a value function and a policy . We can think of such a pair as an incomplete model that is defined by the Bellman inequality interpreted as a constraint on the transition kernel (or and the reward function ):

Assuming that our incomplete prior assigns weight to this incomplete hypothesis, we get a sort of Occam's razor for policies.

Clarifying "AI Alignment"

I'm claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.

Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:

- Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.

- Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.

In this situation, machine A is a much better plan.

I am struggling to understand how does it work in practice. For example, consider dialogic RL. It is a scheme intended to solve AI alignment in the strong sense. The intent-alignment thesis seems to say that I should be able to find some proper subset of the features in the scheme which is sufficient for alignment in practice. I can approximately list the set of features as:

- Basic question-answer protocol
- Natural language annotation
- Quantilization of questions
- Debate over annotations
- Dealing with no user answer
- Dealing with inconsistent user answers
- Dealing with changing user beliefs
- Dealing with changing user preferences
- Self-reference in user beliefs
- Quantilization of computations (to combat non-Cartesian daemons, this is not in the original proposal)
- Reverse questions
- Translation of counterfactuals from user frame to AI frame
- User beliefs about computations

EDIT: 14. Confidence threshold for risky actions

Which of these features are necessary for intent-alignment and which are only necessary for strong alignment? I can't tell.

I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs.

I am not an expert but I expect that bridges are constructed so that they don't enter high-amplitude resonance *in the relevant range of frequencies* (which is an example of using assumptions in our models that need independent validation). We want bridges that *don't* fall, don't we?

I don't build mathematical theories of how to write code, and usually don't prove my code correct

On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.

It also sounds like you're making a normative claim for proofs; I'm more interested in the empirical claim.

I am making both claims to some degree. I can imagine a universe in which the empirical claim is true, and I consider it plausible (but far from certain) that we live in such a universe. But, even just understanding whether we live in such a universe requires building a mathematical theory.

On the falsifiability of hypercomputation

The problem with constructive halting oracles is, they assume the ability to output an arbitrary natural number. But, realistic agents can observe only a finite number of bits per unit of time. Therefore, there is no way to directly observe a constructive halting oracle. We can consider a realization of a constructive halting oracle in which the oracle outputs a natural number one digit at a time. The problem is, since you don't know how long the number is, a candidate oracle might never stop producing digits. In particular, take any non-standard model of PA and consider an oracle that behaves accordingly. On some machines that don't halt, such an oracle will claim they do halt, but when asked for the time it will produce an infinite stream of digits. There is no way to distinguish such an oracle from the real thing (without assuming axioms beyond PA).

Moreover, this is a special case of a general theorem: if there is any computable procedure that asymptotically tests a complete hypothesis about the environment, then this hypothesis must describe a computable environment. This still allows for testable *incomplete* hypotheses that postulate hypercomputation in the environment, for example "this box is a halting oracle in *some* model of PA". But, there is a sense in which the hypothesis itself is computable.

(Btw I made this observation before)

Clarifying "AI Alignment"

Okay. What's the argument that the risk is great (I assume this means "very bad" and not "very likely" since by hypothesis it is unlikely), or that we need a lot of time to solve it?

The reasons the risk are great are standard arguments, so I am a little confused why you ask about this. The setup effectively allows a superintelligent malicious agent (Beta) access to our universe, which can result in extreme optimization of our universe towards inhuman values and tremendous loss of value-according-to-humans. The reason we need a lot of time to solve it is simply that (i) it doesn't seem to be an instance of some standard problem type which we have standard tools to solve and (ii) some people have been thinking on these questions for a while by now and did not come up with an easy solution.

It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.

Yup, I agree (with the caveat that it doesn't have to be a human's interpretation). Nonetheless, an interpretation of what the algorithm does can give you a lot of evidence about whether or not something is actually safe.

Then, I don't understand why you believe that work on anything other than intent-alignment is much less urgent?

The point is just that it seems very plausible that someone might design a theoretical model of the environment in which the bridge is safe, but that model neglects to include resonance because the designer didn't think of it.

"Resonance" is not something you need to explicitly include in your model, it is just a consequence of the equations of motion for an oscillator. This is actually an important lesson about why we need theory: to construct a useful theoretical model you don't need to know all possible failure modes, you only need a reasonable set of assumptions.

I think in practice our confidence in safety often comes from empirical tests.

I think that in practice our confidence in safety comes from a *combination* of theory and empirical tests. And, the higher the stakes and the more unusual the endeavor, the more theory you need. If you're doing something low stakes or something very similar to things that have been tried many times before, you can rely on trial and error. But if you're sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. Yes, you will test the modules on Earth in conditions as similar to the real environment as you can (respectively, you will do experiments with narrow AI). But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.

I can quite easily imagine how "human thinking for a day is safe" can be a mathematical assumption.

Agreed, but if you want to eventually talk about neural nets so that you are talking about the AI system you are actually building, you need to use the neural net ontology, and then "human thinking for a day" is not something you can express.

I disagree. For example, suppose that we have a theorem saying that an ANN with particular architecture and learning algorithm can learn any function inside some space with given accuracy. And, suppose that "human thinking for a day" is represented by a mathematical function that we *assume* to be inside and that we *assume* to be "safe" in some formal sense (for example, it computes an action that doesn't lose much long-term value). Then, your model can *prove* that imitation learning applied to human thinking for a day is safe. Of course, this example is trivial (modulo the theorem about ANNs), but for more complex settings we can get results that are non-trivial.

Clarifying "AI Alignment"

For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely.

First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk. Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is. Hence, I am skeptical about the usefulness of intent-alignment.

For a more "mundane" example, take IRL. Is IRL intent aligned? What if its assumptions about human behavior are inadequate and it ends up inferring an entirely wrong reward function? Is it still intent-aligned since it is trying to do what the user wants, it is just wrong about what the user wants? Where is the line between "being wrong about what the user wants" and optimizing something completely unrelated to what the user wants?

It seems like intent-alignment depends on our *interpretation* of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.

For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result.

My guess is that any such result will either require samples exponential in the dimensionality of the input space (prohibitively expensive) or the simple and natural condition won't hold for the vast majority of cases that neural networks have been applied to today.

I don't know why you think so, but at least this is a good crux since it seems entirely falsifiable. In an any case, exponential sample complexity definitely doesn't count as "strong feasibility".

I don't find smoothness conditions in particular very compelling, because many important functions are not smooth (e.g. most things involving an if condition).

Smoothness is just an example, it is not necessarily the final answer. But also, in classification problems smoothness usually translates to a *margin* requirement (the classes have to be separated with sufficient distance). So, in some sense smoothness allows for "if conditions" as long as you're not too sensitive to the threshold.

You are a bridge designer. You make the assumption that forces on the bridge will never exceed some value K (necessary because you can't be robust against unbounded forces). You prove your design will never collapse given this assumption. Your bridge collapses anyway because of resonance.

I don't understand this example. If the bridge can never collapse as long as the outside forces don't exceed K, then resonance is covered as well (as long as it is produced by forces below K). Maybe you meant that the outside forces are also assumed to be stationary.

The broader point is that when the environment has lots of complicated interaction effects, and you must make assumptions, it is very hard to find assumptions that actually hold.

Nevertheless most engineering projects make heavy use of theory. I don't understand why you think that AGI must be different?

The issue of assumptions in strong feasibility is equivalent to the question of, whether powerful agents require highly informed priors. If you need complex assumptions then effectively you have a highly informed prior, whereas if your prior is uninformed then it corresponds to simple assumptions. I think that Hanson (for example) believes that it is indeed necessary to have a highly informed prior, which is why powerful AI algorithms will be complex (since they have to encode this prior) and progress in AI will be slow (since the prior needs to be manually constructed brick by brick). I find this scenario unlikely (for example because humans successfully solve tasks far outside the ancestral environment, so they can't be relying on genetically built-in priors *that* much), but not ruled out.

However, I assumed that your position is not Hansonian: correct me if I'm wrong, but I assumed that you believed deep learning or something similar is likely to lead to AGI relatively soon. Even if not, you were skeptical about strong feasibility results even for deep learning, regardless of hypothetical future AI technology. But, it doesn't look like deep learning relies on highly informed priors. What we have is, relatively simple algorithms that can, with relatively small (or even no) adaptations solve problems in completely different domains (image processing, audio processing, NLP, playing many very different games, protein folding...) So, how is it possible that all of these domains have some highly complex property that they share, and that is somehow encoded in the deep learning algorithm?

It's really the assumptions that make me pessimistic, which is why it would be a significant update if I saw a mathematical definition of safety that I thought actually captured "safety" without "passing the buck"

I'm curious whether proving a *weakly* feasible subjective regret bound under assumptions that you agree are otherwise realistic qualifies or not?

...but the theory-based approach requires you to limit your assumptions to things that can be written down in math (e.g. this function is K-Lipschitz) whereas a non-theory-based approach can use "handwavy" assumptions (e.g. a human thinking for a day is safe), which drastically opens up the space of options and makes it more likely that you can find an assumption that is actually mostly true.

I can quite easily imagine how "human thinking for a day is safe" can be a mathematical assumption. In general, which assumptions are formalizable depends on the ontology of your mathematical model (that is, which real-world concepts correspond to the "atomic" ingredients of your model). The choice of ontology is part of drawing the line between what you want your mathematical theory to prove and what you want to bring in as outside assumptions. Like I said before, this line definitely has to be drawn somewhere, but it doesn't at all follow that the entire approach is useless.

Clarifying "AI Alignment"

...first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work.

I completely agree with this, but isn't this also true of subjective regret bounds / definition-optimization?

The idea is, we will solve the alignment problem by (i) formulating a suitable learning protocol (ii) formalizing a set of assumptions about reality and (iii) proving that under these assumptions, this learning protocol has a reasonable subjective regret bound. So, the role of the subjective regret bound is making sure that the what we came up with in i+ii is *sufficient*, and also guiding the search there. The subjective regret bound does *not* tell us whether particular assumptions are *realistic*: for this we need to use common sense and knowledge outside of theoretical computer science (such as: physics, cognitive science, experimental ML research, evolutionary biology...)

Maybe your point is that there are failure modes that aren't covered by intent alignment, in which case I agree, but also it seems like the OP very explicitly said this in many places.

I disagree with the OP that (emphasis mine):

I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are

much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.

I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is *not* sufficient to address the urgent core of the problem.

And meanwhile I think very messy real world domains almost always limit strong feasibility results. To the extent that you want your algorithms to do vision or NLP, I think strong feasibility results will have to talk about the environment; it seems quite infeasible to do this with the real world.

I don't think strong feasibility results will have to talk about the environment, or rather, they will have to talk about it on a very high level of abstraction. For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result. We can then debate whether an using such a smooth approximation is sufficient for superhuman performance, but establishing this requires different tools, like I said above.

The way I imagine it, AGI theory should ultimately arrive at some class of priors that are on the one hand rich enough to deserve to be called "general" (or, practically speaking, rich enough to produce superhuman agents) and on the other hand narrow enough to allow for efficient algorithms. For example the Solomonoff prior is *too* rich, whereas a prior that (say) describes everything in terms of an MDP with a small number of states is too narrow. Finding the golden path in between is one of the big open problems.

That said, most of this belief comes from the fact that empirically it seems like theory often breaks down when it hits the real world.

Does it? I am not sure why you have this impression. Certainly there are phenomena in the real world that we don't *yet* have enough theory to understand, and certainly a given theory will fail in domains where its assumptions are not justified (where "fail" and "justified" can be a manner of degree). And yet, theory obviously played and plays a central role in science, so I don't understand whence the fatalism.

However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability.

Idk, you could have a nondisclosure-by-default policy if you were worried about this. Maybe this can't work for you though.

That seems like it would be an extremely not cost-effective way of making progress. I would invest a lot of time and effort into something that would only be disclosed to the select few, for the sole purpose of convincing them of something (assuming they are even interested to understand it). I imagine that solving AI risk will require collaboration among many people, including sharing ideas and building on other people's ideas, and that's not realistic without publishing. Certainly I am not going to write a Friendly AI on my home laptop :)

Clarifying "AI Alignment"

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Why doesn't this also apply to subjective regret bounds?

In order to get a subjective regret bound you need to consider an appropriate prior. The way I expect it to work is, the prior guarantees that some actions are safe in the short-term: for example, doing nothing to the environment and asking only sufficiently quantilized queries from the user (see this for one toy model of how "safe in the short-term" can be formalized). Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.

Now, you can say "with the right prior intent-alignment also works". To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second *the assumptions about the prior are doing all the work*. Indeed, we can imagine that the ontology on which the prior is defined includes a "true reward" symbol s.t., *by definition*, the semantics is whatever the user truly wants. An agent that maximizes expected true reward then can be said to be intent-aligned. If it's doing something bad from the user's perspective, then it is just an "innocent" mistake. But, unless we bake some specific assumptions about the true reward into the prior, such an agent can be anything at all.

Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space.

This is related to what I call the distinction between "weak" and "strong feasibility". Weak feasibility means algorithms that are polynomial time in the number of states and actions, or the number of hypotheses. Strong feasibility is supposed to be something like, polynomial time in the description length of the hypothesis.

It is true that currently we only have strong feasibility results for relatively simple hypothesis spaces (such as, support vector machines). But, this seems to me just a symptom of advances in heuristics outpacing the theory. I don't see any reason of principle that significantly limits the strong feasibility results we can expect. Indeed, we already have some advances in providing a theoretical basis for deep learning.

However, I specifically *don't* want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability. Instead, I prefer studying safety on the weak feasibility level until we understood everything important on this level, and only then trying to extend it to strong feasibility. This creates somewhat of a conundrum where apparently the one thing that can convince you (and other people?) is the thing I don't think should be done soon.

I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured "safety" without "passing the buck". I think subjective regret bounds and CIRL both make some progress on this, but somewhat "pass the buck" by requiring a well-specified hypothesis space for rewards / beliefs / observation models.

Can you explain what you mean here? I agree that just saying "subjective regret bound" is not enough, we need to understand all the assumptions the prior should satisfy, reflecting considerations such as, what kind of queries can or cannot manipulate the user. Hence the use of quantilization and debate in Dialogic RL, for example.

Vanessa Kosoy's Shortform

There is some similarity, but there are also major differences. They don't even have the same type signature. The dangerousness bound is a *desideratum* that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.

The reason I pointed out the relation to corrigibility is not because I think that's the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that "if you run this AI, this won't make things worse than *not* running the AI", no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.

From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with *some* rebuttal. You can't (empirically and in the worst-case) prove a negative.

I decided that the answer deserves its own post.