Abram Demski

Probability is Real, and Value is Complex

As far as I can tell, this is the entire point. I don't see this 2D vector space actually being used in modeling agents, and I don't think Abram does either.

I largely agree. In retrospect, a large part of the point of this post for me is that it's practical to think of decision-theoretic agents as having expected value estimates for everything without having a utility function anywhere, which the expected values are "expectations of".

A utility function is a gadget for turning probability distributions into expected values. This object makes sense in a context like VNM, where you are asking agents to judge between arbitrary gambles. In the jeffrey-bolker setting, you instead only ask agents to choose between *events*, not gambles. This allows us to directly derive coherence constraints on expectations without introducing a function they're expectations "of".

For me, this fits better with the way humans seem to think; it's relatively easy to compare events to each other, but nigh impossible to take entire world-descriptions and compare them (which is what a utility function does).

The rotation comes into play because looking at preferences this way is much more 'situated': you are only required to have preferences relating to your current beliefs, rather than relating to arbitrary probability distributions (as in VNM). We can intuit from our experience that there is some wiggle room between probability vs preference when representing situations in the real world. VNM doesn't model this, because probabilities are simply given to us in the VNM setting, and we're to take them as gospel truth.

So jeffrey-bolker seems to do a better job of representing the subjective nature of probability, and the vector rotations illustrate this.

On the other hand, I think there is a real advantage to the 2d vector representation of a preference structure. For agents with identical beliefs (the "common prior assumption"), Harsanyi showed that cooperative preference structures can be represented by simple linear mixtures (Harsanyi's utilitarian theorem). However, Critch showed that combining preferences in general is not so simple. You can't separately average two agent's beliefs and their utility function; you have to dynamically change the weights of the utility-function averaging *based on* how bayesian updates shift the weights of the probability mixture.

Averaging the vector-valued measures together works fine, though, I believe. (I haven't worked it out in detail.) If true, this makes vector-valued measures an easier way to think about coalitions of cooperating agents who merge preferences in order to select a pareto-optimal joint policy.

Formal Inner Alignment, Prospectus

Not to disagree hugely, but I have heard one religious conversion (an enlightenment type experience) described in a way that fits with "takeover without holding power over someone". Specifically this person described enlightenment in terms close to "I was ready to pack my things and leave. But the poison was already in me. My self died soon after that."

It's possible to get the general flow of the arguments another person would make, spontaneously produce those arguments later, and be convinced by them (or at least influenced).

Why Agent Foundations? An Overly Abstract Explanation

Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by "guessing true names". I think the approach makes sense, but my argument for why this is the case does differ from John's arguments here.

Why Agent Foundations? An Overly Abstract Explanation

You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So "zero" is best understood as "zero within some tolerance". So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently.

This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem.

We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can't treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can't pretend perfect mathematical knowledge of it, either; we have error tolerances.

So your blackbox/whitebox dichotomy doesn't fit the situation very well.

But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect?

Perhaps we should really look at a range of examples, not just one? And judge John's point as reasonable if and only if we can find some cases where effectively perfect proxies were found?

Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we *do in fact need the analog of perfect zero correlation* in order to succeed. So John's plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise * at least* in their physical tolerances.

In which case, I would reply that the idea *is not* to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course. So there is hope that we will not end up in a situation where every tiny flaw is exploited. What we are looking for is plans which robustly get us to that point.

Why is pseudo-alignment "worse" than other ways ML can fail to generalize?

So, I think the other answers here are adequate, but not super satisfying. Here is my attempt.

The frame of "generalization failures" naturally primes me (and perhaps others) to think of ML as hunting for useful patterns, but instead fitting to noise. While pseudo-alignment is certainly a type of generalization failure, it has different connotations: that of a system which has "correctly learned" (in the sense of internalizing knowledge for its own use), but still does not perform as intended.

The mesa-optimizers paper defines inner optimizers as performing "search". I think there are some options here and we can define things slightly differently.

In Selection vs Control, I split "optimization" up into two types: "selection" (which includes search, and also weaker forms of selection such as mere sampling bias), and "control" (which implies actively steering the world in a direction, but doesn't always imply search, EG in the case of a thermostat).

In mesa-search vs mesa-control, I applied this distinction to mesa-optimization, arguing that mesa-optimizers which do not use search could still present a danger.

**Mesa-controllers** are a form of inner optimizers which reliably steer the world in a particular direction. These are distinguished from 'mere' generalization failure because generalization failures do not usually have such an impact. If we define pseudo-alignment in this way, you could say we are defining it by its impact. Clearly, more impactful generalization failures are more concerning. However, you might think it's a little weird to invent entirely new terminology for this case, rather than referring to it as "impactful generalization failures".

**Mesa-searchers** are a form of inner optimizers characterized by performing internal search. You could say that they're clearly computing something coherent, just not what was desired (which may not be the case for 'mere' generalization failures). These are more clearly a distinct phenomenon, particularly if the *intended* behavior *didn't* involve search. (It would seem odd to call them "searching generalization failures" imho.) But the safety concerns are less directly obvious.

It's only when we put these two together that we have something both distinct and of safety concern. Looking for "impactful generalization failures" gives us relatively little to grab onto. But it's particularly plausible that mesa-searchers will also be mesa-controllers, because the machinery for complex planning is present. So, this combination might be particularly worth thinking about.

we'll see this appear mathematically in the definition of the property or in theorems about it, whether or not we have explicitly considered the possibility of mesa-optimizers. (I suppose the argument could be that some candidate safety properties implicitly assume no optimum is a mesa-optimizer, and thus appear to apply to all optima while not really doing so -- somewhat analogous to early notions of continuity which implicitly assumed away the Weierstrass function. But if so, I need a real example of such a case to convince me.)

I tend to agree with this line of thinking. IE, it seems intuitive to me that highly robust alignment technology would rely on arguments that don't explicitly mention inner optimization anywhere, because those failure modes are ruled out via the same general arguments which rule out other failure modes. However, it also seems plausible to me that it's useful to think about inner alignment along the way.

You wanted a convincing example. I think The Solomonoff Prior Is Malign could be such an example. Before becoming aware of this argument, it seemed pretty plausible to me that the Solomonoff prior described a kind of rational ideal for induction. This isn't a full "here's a safety argument that would go through if we assumed no-inner-optimizers", but in a parallel universe where we have access to infinite computation, it could be close to that. (EG, someone could argue that the chance of Solomonoff Induction resulting in generalization failures is very low, and then change their mind when they hear the Solomonoff-is-malign argument.)

Also, it seems highly plausible to me that inner alignment is a useful thing to have in mind for "less than highly robust" alignment approaches (approaches which seek to grab the lower-hanging fruit of alignment research, to create systems that are aligned in the worlds where achieving alignment isn't so hard after all). These approaches can, for example, employ heuristics which make it somewhat unlikely that inner optimizers will emerge. I'm not very interested in that type of alignment research, because it seems to me that alignment technology needs to be backed up with rather tight arguments in order to have any realistic chance of working; but it makes sense for some people to think about that sort of thing.

ELK Computational Complexity: Three Levels of Difficulty

This definitely isn't well-defined, and this is the main way in which ELK itself is not well-defined and something I'd love to fix. That said, for now I feel like we can just focus on cases where the counterexamples

obviouslyinvolve the model knowing things (according to this informal definition). Someday in the future we'll need to argue about complicated border cases, because our solutions work in every obvious case. But I think we'll have to make a lot of progress before we run into those problems (and I suspect that progress will mostly resolve the ambiguity).

Well, it might be that a proposed solution follows relatively easily from a proposed definition of knowledge, in some cases. That's the sort of solution I'm going after at the moment.

This still leaves the question of borderline cases, since the definition of knowledge may be imperfect. So it's not necessarily that I'm trying to solve the borderline cases.

We discuss the definition of "knowledge" a bit in this appendix;

Ah, yep, I missed that!

This isn't clear to me---we argue that direct translation can be arbitrarily complex, and that we need to solve ELK anyway, but we don't think the translator can be arbitrarily complex

relative to the predictor. So we can still hope that jointly learning the (predictor, translator) is not much harder than learning the predictor alone.

Ahh, I see. I had 100% interpreted the computational complexity of the Reporter to be 'relative to the predictor' already. I'm not sure how else it could be interpreted, since the reporter is given the predictor's state as input, or at least given some form of query access.

What's the intended mathematical content of the statement "the direct translation can be arbitrarily complex", then?

Also, why don't you think the direct translator can be arbitrarily complex relative to the predictor?

> The answer (at least, as I see it) is

by arguing that this case is impossible.

If we find a case that's impossible, I definitely want to try to refine the ELK problem statement, rather than implicitly narrowing the statement to something like "solve ELK in all the cases where it's possibly possible" (not sure if that's what you are suggesting here). And right now I don't know of any cases that seem impossible.

Yeah, sorry, poor wording on my part. What I meant in that part was "argue that the direct translator cannot be arbitrarily complex", although I immediately mention the case you're addressing here in the parenthetical right after what you quote.

In any case, what you say makes sense.

Job Offering: Help Communicate Infrabayesianism

Job applicants often can't start right away; I would encourage you to apply!

Job Offering: Help Communicate Infrabayesianism

Infradistributions are a generalization of sets of probability distributions. Sets of probability distributions are used in "imprecise bayesianism" to represent the idea that we haven't quite pinned down the probability distribution. The most common idea about *what to do* when you haven't quite pinned down the probability distribution is to *reason in a worst-case way* about what that probability distribution is. Infrabayesianism agrees with this idea.

One of the problems with imprecise bayesianism is that they haven't come up with a good update rule -- turns out it's much trickier than it looks. You can't just update all the distributions in the set, because [reasons i am forgetting]. Part of the reason infrabayes generalizes imprecise bayes is to fix this problem.

So you can think of an infradistribution mostly as a generalization of "sets of probability distributions" which has a good update rule, unlike "sets of probability distributions".

Why is this great?

Mainly because "sets of probability distributions" are actually a pretty great idea for decision theory. Regular Bayes has the "realizability" problem: in order to prove good loss bounds, you need to assume the prior is "realizable", which means that one of the hypotheses in the prior is true. For example, with Solomonoff, this amounts to assuming the universe is computable.

Using sets instead, *you don't need to have the correct hypothesis in your prior;* you only need to have an imprecise hypothesis which *includes* the correct hypothesis, and "few enough" other hypotheses that you get a reasonably tight bound on loss.

Unpacking that a little more: * if* the learnability condition is met, then

This allows us to get good guarantees against non-computable worlds, if they have *some* computable regularities. Generalizing imprecise probabilities to the point where there's a nice update rule was necessary to make this work.

There is currently no corresponding result for logical induction. (I think something might be possible, but there are some onerous obstacles in the way.)

ELK Thought Dump

Fair enough!

Up to here made sense.

After here I was lost. Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that?

You're saying a lot about what the "objects of study" are and aren't, but not very concretely, and I'm not getting the intuition for why this is important. I'm used to the idea that the points aren't really the objects of study in topology; the opens are the more central structure.

But the important question for a proposed modeling language is how well it models what we're after.

It seems like you are trying to do something similar to what cartesian frames and finite factored sets are doing, when they reconstruct time-like relationships from other (purportedly more basic) terms. Would you care to compare the reconstructions of time you're gesturing at to those provided by cartesian frames and/or finite factored sets?