Stuart Armstrong has claimed to beat Goodhart with Bayesian uncertainty -- rather than assuming some particular objective function (which you try to make as correct as possible), you represent some uncertainty. A similar claim was made in The Optimizer's Curse and How to Beat It, the essay which introduced a lot of us to ... well, not Goodhart's Law itself (the post doesn't make mention of Goodhart), but, that kind of failure. I myself claimed that Bayes beats regressional Goodhart, in Robust Delegation:

I now think this isn't true -- Bayes' Law doesn't beat Goodhart fully. It doesn't even beat regressional Goodhart fully. (I'll probably edit Robust Delegation to change the claim at some point.)

(Stuart makes some more detailed claims about AI and the nearest-unblocked-strategy problem which aren't exactly claims about Goodhart, at least according to him. ** I don't fully understand Stuart's perspective, and don't claim to directly address it here.** I am mostly only addressing the question of the title of my post: does Bayes beat Goodhart?)

# If approximate solutions are concerning, why would mixtures of them be unconcerning?

My first argument is a loose intuition: Goodhartian phenomena suggest that somewhat-correct-but-not-quite-right proxy functions are not safe to optimize (and in some sense, the more optimization pressure is applied, the less safe we expect it to be). Assigning weights to a bunch of somewhat-but-not-quite-right possibilities just gets us another somewhat-but-not-quite-right possibility. Why would we expect this to fundamentally solve the problem?

- Perhaps the Bayesian mixture across hypotheses is
*closer to being correct*, and therefore, gives us an approximation which is able to stand up to more optimization pressure before it breaks down. But this is a quantitative distinction, not a qualitative one.*How big*of a difference do we expect that to make? Wouldn't it still break down about as badly when put under tremendous optimization pressure? - Perhaps the point of the Bayesian mixture is that, by quantifying uncertainty about the various hypotheses, it encourages strategies which hedge their bets -- satisfying a broad range of possible utility functions, by avoiding doing something terrible for one utility function in order to get a few more points for another. But this incentive to hedge bets is fairly weak; the optimization is still encouraged to do something really terrible for one function if it leads to a moderate increase for many other utility functions.

My intuition there doesn't address the gears of the situation adequately, though. Let's get into it.

# Overcoming regressional Goodhart requires calibrated learning.

In *Robust Delegation*, I defined regressional Goodhart through the predictable-disappointment idea. Does Bayesian reasoning eliminate predictable disappointment?

Well, it depends on what is meant by "predictable". You could define it as predictable-by-bayes, in which case it follows that Bayes solves the problem. However, I think it is reasonable to at least add a calibration requirement: there should be no way to systematically correct estimates up or down as a function of the expected value.

Calibration seems like it does, in fact, significantly address regressional Goodhart. You can't have seen a lot of instances of an estimate being too high, and still accept that too-high estimate. It doesn't address extremal Goodhart, because calibrated learning can only guarantee that you eventually calibrate, or converge at some rate, or something like that -- extreme values that you've rarely encountered would remain a concern.

(Stuart's "one-in-three" example in the Defeating Goodhart post, and his discussion of human overconfidence more generally, is somewhat suggestive of calibration.)

Bayesian methods are not always calibrated. Calibrated learning is not always Bayesian. (For example, logical induction has good calibration properties, and so far, hasn't gotten a really satisfying Bayesian treatment.)

This might be confusing if you're used to thinking in Bayesian terms. If you think in terms of the diagram I copied from *Robust Delegation*, above: you have a prior which stipulates probability of true utility given observation ; your expectation is the expected value of for a particular value of ; is not predictably correctable with respect to your prior. What's the problem?

The problem is that this line of reasoning assumes that your prior is *objectively correct*. This doesn't generally make sense (especially from a Bayesian perspective). So, it is perfectly consistent for you to collect many observations, and see that has some systematic bias. This may remain true *even as you update on those observations *(because Bayesian learning doesn't guarantee any calibration property in general!).

The faulty assumption that your probability distribution is correct is often replaced with the (weaker, but still problematic) assumption that at least one hypothesis within your distribution is objectively correct -- the realizability assumption.

# Bayesian solutions assume realizability.

As discussed in Embedded World Models, the realizability assumption is the assumption that (at least) one of your hypotheses represents the true state of affairs. Bayesian methods often (though not always) require a realizability assumption in order to get strong guarantees. Frequentist methods rarely require such an assumption (whatever else you may say about frequentist methods). Calibration is an example of that -- a Bayesian can get calibration under the assumption of realizability, but, we might want a stronger guarantee of calibration which holds even in absence of realizability.

## "We quantified our uncertainty as best we could!"

One possible bayes-beats-goodhart argument is: "Once we quantify our uncertainty with a probability distribution over possible utility functions, the best we can possibly do is to choose whatever maximizes expected value. Anything else is decision-theoretically sub-optimal."

Do you think that the true utility function is really sampled from the given distribution, in some objective sense? And the probability distribution also quantifies all the things which can count as evidence? If so, fine. Maximizing expectation is the objectively best strategy. This eliminates all types of Goodhart by positing that we've already modeled the possibilities sufficiently well: extremal cases are modeled correctly; adversarial effects are already accounted for; etc.

However, this is unrealistic due to embeddedness: the outside world is much more complicated than any probability distribution which we can explicitly use, since we are ourselves a small part of that world.

Alternatively, do you think the probability distribution really codifies your precise subjective uncertainty? Ok, sure, that would also justify the argument.

Realistically, though, an implementation of this isn't going to be representing your precise subjective beliefs (to the extent you even *have* precise subjective beliefs). It has to hope to have a prior which is "good enough".

In what sense might it be "good enough"?

An obvious problem is that a distribution might be overconfident in a wrong conclusion, which will obviously be bad. The fix for this appears to be: make sure that the distribution is "sufficiently broad", expressing a fairly high amount of uncertainty. But, why would this be good?

Well, one might argue: it can only be worse that our true uncertainty to the extent that it ends up assigning too little weight to the correct option. So, if the probability function isn't too small for any of the possibilities which we intuitively assign non-negligible weight, things should be fine.

## "The True Utility Function Has Enough Weight"

First, even assuming the framing of "true utility function" makes sense, it isn't obvious to me that the argument makes sense.

If there's a true utility function which is assigned weight , and we apply a whole lot of optimization pressure to the overall mixture distribution, then it is perfectly possible that gets compromised for the sake of satisfying a large number of other . The weight determines a *ratio at which trade-offs can occur,* not a *ratio of the overall resources which we will get* or anything like that.

A first-pass analysis is that has to be more than 1/2 to guarantee any consideration; any weight less than that, and it's possible that is *as low as it can go* in the optimized solution, because some outcome was sufficiently good for all other potential utility functions that it made sense to "take the hit" with respect to . We can't formally say "this probably won't happen, because the odds that the best-looking option is specifically terrible for are low" without assuming something about the distribution of highly optimized solutions.

(Such an analysis might be interesting; I don't know if anyone has investigated from that angle. But, it seems somewhat unlikely to do us good, since it doesn't seem like we can make very nice assumptions about what highly-optimized solutions look like.)

In reality, the worst-case analysis is better than this, because many of the more-plausible should have a lot of "overlap" with ; after all, they were given high weight because they *appeared plausible* somehow (they agreed with human intuitions, or predicted human behavior, etc). We could try to formally define "overlap" and see what assumptions we need to guarantee better-than-worst-case outcomes. (This might have some interesting learning-theoretic implications for value learning, even.)

However, this whole framing, where we assume that there's a and think about its weight, is suspect. Why should we think that there's a "true" utility function which captures our preferences? And, if there is, why should we assume that it has an explicit representation in the hypothesis space?

If we drop this assumption, we get the classical problems associated with non-realizability in Bayesian learning. Beliefs may not converge at all, as evidence accumulates; they could keep oscillating due to inconsistent evidence. Under the interpretation where we still assume a "true" utility function but we don't assume that it is explicitly representable within the hypothesis space, there isn't a clear guarantee we can get (although perhaps the "overlap" analysis can help here). If we don't assume a true utility function at all, then it isn't clear how to even ask questions about how well we do (although I'm not saying there isn't a useful analysis -- I'm just saying that it is unclear to me right now).

Stuart does address this question, in the end:

I've argued that an indescribable hellworld cannot exist. There's a similar question as to whether there exists human uncertainty about U that cannot be included in the AI's model of Δ. By definition, this uncertainty would be something that is currently unknown and unimaginable to us. However, I feel that it's far more likely to exist, than the indescribable hellworld.

Still despite that issue, it seems to me that there are methods of dealing with the Goodhart problem/nearest unblocked strategy problem. And this involves properly accounting for all our uncertainty, directly or indirectly. If we do this well, there no longer remains a Goodhart problem at all.

Perhaps I agree, if "properly accounting for all our uncertainty" includes robustness properties such as calibrated learning, *and* if we restrict our attention to regressional Goodhart, ignoring the other three.

Well... what about the others, then?

# Overcoming adversarial Goodhart seems to require randomization.

The argument here is pretty simple: adversarial Goodhart enters into the domain of game theory, in which mixed strategies tend to be very useful. Quantilization is one such mixed strategy, which seems to usefully address Goodhart to a certain extent. I'm not saying that quantilization is the ultimate solution here. But, it does seem to me like quantilization is significant enough that a solution to Goodhart should say something about the class of problems which quantilization solves.

In particular, a property of quantilization which I find appealing is the way more certainty about the utility function implies that more optimization power can be safely applied to making decisions. This informs my intuition that applying arbitrarily high optimization power does not become safe simply because you've explicitly represented uncertainty about utility functions -- no matter how accurately, short of "perfectly accurately" (which isn't even a meaningful concept), it only seems to justify a limited amount of optimization pressure. This story may be an incorrect one, but if so, I'd like to really understand why it is incorrect.

Unlike the previous sections, this doesn't necessarily step outside of typical Bayesian thought, since this kind of game-theoretic thinking is more or less within the purview of Bayesianism. However, the simple "Bayes solves Goodhart" story doesn't explicitly address this.

*(I haven't addressed causal Goodhart anywhere in this essay, since it opens up the whole decision-theoretic can of worms, which seems somewhat beside the main point. (I suppose, arguably, game-theoretic concerns could be beside the point as well -- but, they feel more directly relevant to me, since quantilization is fairly directly about solving Goodhart.))*

# In summary:

- If optimizing an arbitrary somewhat-but-not-perfectly-right utility function gives rise to serious Goodhart-related concerns, then why does a mixture distribution over such functions alleviate such concerns? Aren't they just averaging together to yield yet another somewhat-but-not-quite-right function?
- Regressional Goodhart seems better-addressed by calibrated learning than it does by Bayesian learning.
- Bayesian learning tends to require a realizability assumption in order to have good properties (including calibration).
- Even assuming realizability, heavily optimizing a mixture distribution over possible utility functions seems dicey -- it can end up throwing away all the real value if it finds a way to jointly satisfy a lot of the wrong ones. (It is possible that we can find reasonable assumptions under which this doesn't happen, however.)
- Overcoming adversarial Goodhart seems to require mixed strategies, which the simple "bayesian uncertainty" story doesn't explicitly address.

Thanks for this post! Good insights that refined my arguments.

I'll present three points:

I think we agree that Goodhart can be ameliorated by adding this extra information/uncertainty; it's not clear whether it can be completely resolved.

My current intuition is that thinking in terms of non-realizable epistemology will give a more robust construction process,

even thoughthe constructive way of thinking justifies a kind of realizability assumption. This is partly because it allows us to do without the massive-enough set of hypotheses (which one may have to do without in practice), but also because it seems closer to the reality of "humans don't really have a utility function, not exactly".However, I think I haven't sufficiently internalized your point about utility being defined by a constructive process, so my opinion on that may change as I think about it more.

Concerning #3: yeah, I'm currently thinking that you need to make some more assumptions. But, I'm not sure I want to make assumptions about resources. I think there may be useful assumptions related to the way the hypotheses are learned -- IE, we expect hypotheses with nontrivial weight to have a lot of agreement

because they are candidate generalizations of the same data, which makes it somewhat hard to entirely dissatisfy some while satisfying others. This doesn't seem quite helpful enough, but, perhaps something in that direction.In any case, I agree that it seems interesting to explore assumptions about the mutual satisfiability of different value functions.

"resources" is more of shorthand for "the best utility function looks like a smoothmin of a subset of the different features. Given that assumption, the best fuzzy approximation looks like a smoothmin of all the features, with different weights".

By the way I just want to note that expected value isn't the only option available for aggregating utility functions. There's also stuff like Bostrom's parliament idea. I expect there are many opportunities for cross fertilization between AI safety and philosophical work on moral uncertainty.

Indeed we don't want such linear behavior. The AI should preserve the potential for maximization of any candidate utility function - first so it has time to acquire all the environment's evidence about the utility function, and then for the hypothetical future scenario of us deciding to shut it off.

See this comment. Stuart and I are discussing what happens after things have converged as much as they're going to, but there's still uncertainty left.

See my much shorter and less developed note to a similar effect: https://www.lesswrong.com/posts/QJwnPRBBvgaeFeiLR/uncertainty-versus-fuzziness-versus-extrapolation-desiderata#kZmpMGYGfwGKQwfZs - and I agree that regressional and extremal goodhart cannot be fixed purely with his solution.

I will, however, defend some of Stuart's suggestions as they relate to causal Goodhart in a non-adversarial setting. - I'm also avoiding the can of worms of game theory. In that case, both randomization AND mixtures of multiple metrics can address Goodhart-like failures, albeit in different ways. I had been thinking about this in the context of policy - https://mpra.ub.uni-muenchen.de/90649/ - rather than AI alignment, but some of the arguments still apply. (One critical argument that doesn't fully apply is that "good enough" mitigation raises the cognitive costs of cheating to a point where aligning with the true goal is cheaper. I also noted in the paper that satisficing is useful for limiting the misalignment from metrics, and quantilization seems like one promising approach for satisficing for AGI.)

The argument for causal goodhart is that randomization and mixed utilities are both effective in mitigating causal structure errors that lead to causal Goodhart in the one-party case. That's because the failure occurs when uncertainty or mistakes about causal structure leads to choice of metrics that are corrrelated with the goal, rather than causal of the goal. However, if even some significant fraction or probability of the metric is causally connected to the metrics in ways that cannot be gamed, it can greatly mitigate this class of failure.

To more clearly apply this logic to human utility, if we accidentally think that endorphins in the brain are 100% of human goals, AGI might want to tile the universe with rats on happy drugs, or the moral equivalent. If we assign this only 50% weight, of have a 50% probability that it will be the scored outcome, and we define something that requires a different way of creating what we actually think of as happiness / life satisfaction, it does not just shift the optimum from 50% of the universe tiled with rat brains. This is because the alternative class of hedonium will involve a non-trivial amount of endorphins as well, as long as other solutions have anywhere close to as much endorphins, they will be preferred. (In this case, admittedly, we got the endorphin goal so wrong that 50% of the universe tiled in rats on drugs is likely - bad enough utility functions can't be fixed with either randomization or weighting. But if a causal mistake can be fixed with either a probabilistic or a weighting solution, it seems likely it can be fixed with the other.)

If there's 50% on a paperclips-maximizing utility function and 50% on staples, there's not really any optimization pressure put toward satisfying both.

be that there's a sorta-paperclip-sorta-staple (let's say 'stapleclip' for short), which the AGI will be motivated to find in order to get a moderately high rating according to both strategies.couldit could be that trying to be both paperclip and staple at the same time reduces the overall efficiency. Maybe the most efficient nanometer-scale stapleclip is significantly larger than the most efficient paperclip or staple, as a result of having to represent the critical features of both paperclips and staples. In this case, the AGI will prefer to gamble, tiling the universe with whatever is most efficient, and giving no consideration at all to the other hypothesis.However,That's the essence of my concern: uncertainty between possibilities does not particularly push toward jointly maximizing the possibilities. At least, not without further assumptions.

One thing I’ve been thinking about recently is: why does this happen? Could we have predicted the general phenomenon in advance, without imagining individual scenarios? What aspect of the structure of optimal goal pursuit in an environment reliably produces this result?

Why is this important? If the thing with the highest score is always the best action to take, why does it matter if that score is an overestimate? Utility functions are fictional anyway right?

If I understand correctly, extremal Goodhart is essentially the same as distributional shift from the Concrete Problems in AI Safety paper.

In any case... I'm not exactly sure what you mean by "calibration", but when I say "calibration", I refer to "knowing what you know". For example, when I took this online quiz, it told me that when I said I was extremely confident something was true, I was always right, and when said I was a little confident something was true, I was only right 66% of the time. I take this as an indicator that I'm reasonably "well-calibrated"; that is, I have a sense of what I do and don't know.

A calibrated AI system, to me, is one that correctly says "this thing I'm looking at is an unusual thing I've never encountered before, therefore my 95% credible intervals related to it are very wide, and the value of clarifying information from my overseer is very high".

Your complaints about Bayesian machine learning seem correct. My view is that addressing these complaints & making some sort of calibrated learning method competitive with deep learning is the best way to achieve FAI. I haven't yet seen an FAI problem which seems like it can't somehow be reduced to calibrated learning.

I'm not super hung up on statistical guarantees, as I haven't yet seen a way to make them in general which doesn't require making some sort of unreasonable or impractical assumption about the world (and I'm skeptical such a method exists). The way I see it, if your system is capable of self-improving in the right way, it should be able to overcome deficiencies in its world-modeling capabilities for itself. In my view, the goal is to build a system which gets safer as it self-improves & becomes better at reasoning.

If our AI system assigns high subjective credence to a large variety of utility functions, then the value of information which helps narrow things down is high.

To oversimplify my preferred approach: The initial prior acts as a sort of net which should have the true utility function in it somewhere. Clarifying questions to the overseer let the AI pull this net tight around a much smaller set of possible utility functions. It does this until the remaining utility functions can't easily be distinguished through clarifying questions, and/or the remaining utility functions all say to do the same thing in scenarios of near-term interest. If we find ourselves in some unusual unanticipated situation, the utility functions will likely disagree on what to do, and then the clarifying questions start again.

Technically, you don't need this assumption. As I wrote in this comment: "it's not necessary for our actual preferences to be among the ensemble of models if for any veto that our actual preferences would make, there's

somemodel in the ensemble that also makes that veto."(I haven't read a lot about quantilization so I can't say much about that. However, a superintelligent adversary seems like something to avoid.)

I agree that this general picture seems to make sense, but, it does not alleviate the concerns which you are responding to. To reiterate: if there are serious Goodhart-shaped concerns about mostly-correct-but-somewhat-wrong utility functions breaking under optimization pressure, then why do those concerns go away for mixture distributions?

I agree that the uncertainty will cause the AI to investigate, but at some point there will be diminishing returns to investigation; the remaining hypotheses might be utility functions which can't be differentiated by the type of evidence which the AI is able to gather. At that point, the AI will then put a lot of optimization pressure on the mixture distribution which remains. Then, what is the argument that things go well? Won't this run into siren worlds and so on, by default?

Yeah, it seems possible and interesting to formalize an argument like that.

The "adversary" can be something like a mesa-optimizer arising from a search which the system runs in order to solve a problem. If you've got rich enough of a hypothesis space (due to using a rich hypothesis space of world-models, or a rich set of possible human utility functions, etc etc), then you'll have some of those lurking in the hypothesis space. Reasoning in an appropriate way about the possibility, even if you manage to avoid mesa-optimizers in reality, could require game-theoretic reasoning.

OTOH, although quantilization can be justified by a story involving an actual adversary, that's not necessarily the best way to think about what it is really doing. Robustness properties tend to involve some kind of universal quantifier over a bunch of possibilities. Maintaining a property under such a universal quantification is like adversarial game theory; you're trying to do well no matter what strategy the other player uses. So, robustness properties tend to be conveniently described in adversarial terms. That's basically what's going on in the case of quantilization.

Similarly, "adversarial Goodhart" doesn't have to be about superintelligent adversaries, in general. It can be about cases where we want stronger guarantees, and so, are willing to compromiso some decision-theoretic optimality in return for better worst-case guarantees.

The siren world scenario posits an AI that is "actually evil" and is an agent which makes plans to manipulate the user.

If the AI assigns decent credence to a utility function that assigns massive negative utility to "evil and unmitigated suffering", that will cause its subjective expected utility estimate of the siren world to take a big hit. It would be better off implementing the exact same world, minus the evil and unmitigated suffering. The only way it would think that world was actually better with the evil and unmitigated suffering in it is if something went very wrong during the data-gathering process.

I also don't think we should create an agent which makes plans to manipulate the user. The only question it should ever ask the user is the one that maximizes its subjective value of information.

The marketing world problem is very related to the discussion I had with Paul Christiano here. The problem is that the overseer has insufficient time to reflect on their true values. I don't think there is any way of getting around this issue in general: Creating FAI is time-sensitive, which means we won't have enough time to reflect on our true values to be 100% sure that all the input we give the AI is good. In addition to the things I mentioned in that discussion, I think we should:

Make a system that's capable of changing its values "online" in response to our input. Corrigibility lets us procrastinate on moral philosophy.

Instead of trying to build eutopia right off the bat, build an "optimal ivory tower" for doing moral philosophy in. Essentially, implement coherent extrapolated volition in the real world.

Anyway, the reason the Goodhart-shaped concerns go away is because the thing that maximizes the mixture is likely to be something that is approved of by a diverse range of utility functions that are all semi-compatible with the input the user has provided. If there's even a single plausible utility function which strongly disapproves, the value of information of requesting clarification from the overseer regarding that particular plan is high. For a worked example, see "Smile maximization case study" in this essay.

As I said, I think Goodhart's law is largely about distributional shift. My scheme incentivizes the AI to mostly take "on-distribution" plans: plans it is confident are good, because many different ways of looking at the data all point to them being good. "Off-distribution" plans will tend to benefit from clarification first: Some ways of extrapolating the data say they are good, others say they are bad, so VoI is high.

Thanks for bringing this up, I'll think about it. Part of me wants to say "if the AI has wrung all the information it possibly can from the user, and it is well-calibrated [in the sense I defined the term above], then it should just maximize its subjective expected utility at that point, because maximizing expected utility is just what you do!" Or: "If the overseer isn't capable of evaluating plans anymore because they are too complex, maybe it is time for the AI to help the overseer upgrade their intelligence!" But maybe there's an elegant way to implement a more conservative design. (You could, for example, disallow the execution of any plan that the AI thought there was at least a 5% chance was below some utility threshold. But that involves the use of two arbitrary parameters, which seems inelegant.)

I am a little frustrated with your reply (particularly the first half), but I'm not sure if you're really missing my point (perhaps I'll have to think of a different way of explaining it) vs addressing it, but not giving me enough of an argument for me to connect the dots. I'll have to think more about some of your points.

Many of your statements seem true for moderately-intelligent systems of the sort you describe, but, don't clearly hold up when a lot of optimization pressure is applied.

The VOI incentive can't be so strong that the AI is willing to pay arbitrarily high costs (commit the resources of the whole galaxy to investigating ever-finer details of human preferences, deconstruct each human atom by atom, etc...). So, at some point, it can be worthwhile to entirely compromise one somewhat-plausible ui for the sake of others.

This would be untrue if, for example, the system maximized the

weighted product(the weight wi is used as an exponent of the hypothesis ui). It would thenactually neverbe worth it toentirelyzero out one possible utility function for the sake of optimizing others. That proposal likely has its own issues, but I mention it just to make clear that I'm not bemoaning an inevitable fact of decision theory -- therearealternatives.This is one of the assertions which seems generally true of moderately intelligent systems optimizing under value uncertainty, but doesn't seem to hold up as a lot of optimization pressure is applied.

Goodplans will tend to be on-distribution, because that's a good way to reap the gains of many different remaining hypotheses which agree for on-distribution things but disagree elsewhere. Why would thebestplans tend to be on-distribution? Why wouldn't they find weird corner cases where many of the hypotheses give extremely high scores not normally achievable?Yeah, that's the direction I'm thinking in. By the way -- I'm not even trying to say that maximizing subjective expected utility is actually the wrong thing to do (particularly if you've got calibration properties, or knows-what-it-knows properties, or some other learning-theoretic properties which we haven't realized we want yet). I'm just saying that the case is not clear, and it seems like we'd want the case to be clear.

Why would a system of more-than-moderate intelligence find such incorrect hypotheses to be the most plausible ones? There would have to be some reason why all the hypotheses which strongly

dislikedthis corner case were ruled out.I know I'm being a little fuzzy about realizability. Let's consider how humans solve these problems. Suppose you had a pet alien, with alien values, which is capable of limited communication regarding its preferences. The goal of corrigibility is to formalize your good-faith efforts take care of your alien to the best of your ability into an algorithm that a computer can follow. Suppose you think of some very unusual idea for taking care of your alien which, according to a few hypotheses you've come up with for what it likes, would make it extremely happy. If you were reasonably paranoid, you might address the issue of unrealized hypotheses on the spot, and attempt to craft a new hypothesis which is compatible with most/all of the data you've seen and also has your unusual idea inadvertently killing the alien. (This is a bit like "murphyjitsu" from CFAR.) If you

aren'table to generate such a hypothesis, but such a hypothesis does in fact exist, and is the correct hypothesis, and the alien dies after your idea... then you probably aren't super smart.You have to start somewhere. Discussions like this can help make things clear :) I'm getting value from it... you've given me some things to think about, and I think the murphyjitsu idea is something I hadn't thought of previously :)

I think it often makes sense to reason at an informal level before proceeding to a formal one.

Edit: related discussion here.

That's not the case I'm considering. I'm imagining there are hypotheses which strongly dislike the corner cases. They just happen to be out-voted.

Think of it like this. There are a bunch of hypotheses. All of them agree fairly closely with high probability on plans which are "on-distribution", ie, similar to what it has been able to get feedback from humans about (however it does that). The variation is much higher for "off-distribution" plans.

There will be some on-distribution plans which achieve somewhat-high values for all hypotheses which have significant probability. However, the AI will look for ways to achieve even higher expected utility if possible. Unless there are on-distribution plans which max out utility, it may look off-distribution. This seems plausible because the space of on-distribution plans is "smaller"; there's room for a lot to happen in the off-distribution space. That's why it reaches weird corner cases.

And, since the variation is higher in off-distribution space, there may be some options that really look quite good, but which achieve very low value under some of the plausible hypotheses. In fact, because the different remaining hypotheses

are different, it seems quite plausible that highly optimized plans have to start making trade-offs which compromise one value for another. (I admit it is possible the search finds a way to just make everything better according toeveryhypothesis. But that is not what the search istoldto do, not exactly. We can design systems which do something more like that, instead, if that is what we want.)When I put it that way, another problem with going off-distribution is apparent: even if we do find a way to get better scores according to every plausible hypothesis by going off-distribution, we trust those scores less because they're off-distribution. Of course, we could explicitly try to build a system with the goal of remaining on-distribution. Quantilization follows fairly directly from that :)

I realize I'm playing fast and loose with realizability again, but it seems to me that a system which is capable of being "calibrated", in the sense I defined calibration above, should be able to reason for itself that it is less knowledgable about off-distribution points and have some kind of prior belief that the score for any particular off-distribution point is equal to the mean score for the entire (off-distribution?) space, and it should need a fair amount of evidence to shift this prior. I'm not necessarily specifying how concretely to achieve this, just saying that it seems like a desideratum for a "calibrated" ML system in the sense that I'm using the term.

Maybe effects like this could be achieved partially through e.g. having different hypotheses be defined on different subsets of the input space, and always including a baseline hypothesis which is just equal to the mean of the entire space.

If you want a backup system that also attempts to flag & veto any action that looks off-distribution for the sake of redundancy, that's fine by me too. I think some safety-critical software systems for e.g. space shuttles have been known to do this (do a computation in multiple different ways & aggregate them somehow to mitigate errors in any particular subsystem).

My current understanding of quantilization is "choose randomly from the top X% of actions". I don't see how this helps very much with staying on-distribution... as you say, the off-distribution space is larger, so the majority of actions in the top X% of actions could still be off-distribution.

In any case, quantilization seems like it shouldn't work due to the fragility of value thesis. If we were to order all of the possible configurations of Earth's atoms from best to worst according to our values, the top 1% of those configurations is still mostly configurations which aren't very valuable.

The base distribution you take the top X% of is supposed to be related to the "on-distribution" distribution, such that sampling from the base distribution is very likely to keep things on-distribution, at least if the quantilizer's own actions are the main potential source of distributional shift. This could be the case if the quantilizer is the only powerful AGI in existence, and the actions of a powerful AGI are the only thing which would push things into sufficiently "off-distribution" possibilities for there to be a concern. (I'm not saying these are entirely reasonable assumptions; I'm just saying that this is one way of thinking about quantilization.)

The base distribution quantilization samples from is about actions, or plans, or policies, or things like that -- not about configurations of atoms.

So, you should imagine a robot sending random motor commands to its actuators, not highly intelligently steering the planet into a random configuration.