Vladimir Nesov

Wiki Contributions


Morality is Scary

My point is that the alignment (values) part of AI alignment is least urgent/relevant to the current AI risk crisis. It's all about corrigibility and anti-goodharting. Corrigibility is hope for eventual alignment, and anti-goodharting makes inadequacy of current alignment and imperfect robustness of corrigibility less of a problem. I gave the relevant example of relatively well-understood values, preference for lower x-risks. Other values are mostly relevant in how their understanding determines the boundary of anti-goodharting, what counts as not too weird for them to apply, not in what they say is better. If anti-goodharting holds (too weird and too high impact situations are not pursued in planning and possibly actively discouraged), and some sort of long reflection is still going on, current alignment (details of what the values-in-AI prefer, as opposed to what they can make sense of) doesn't matter in the long run.

I include maintaining a well-designed long reflection somewhere into corrigibility, for without it there is no hope for eventual alignment, so a decision theoretic agent that has long reflection within its preference is corrigible in this sense. Its corrigibility depends on following a good decision theory, so that there actually exists a way for the long reflection to determine its preference so that it causes the agent to act as the long reflection wishes. But being an optimizer it's horribly not anti-goodharting, so can't be stopped and probably eats everything else.

An AI with anti-goodharting turned to the max is the same as AI with its stop button pressed. An AI with minimal anti-goodharting is an optimizer, AI risk incarnate. Stronger anti-goodharting is a maintenance mode, opportunity for fundamental change, weaker anti-goodharting makes use of more developed values to actually do the things. So a way to control the level of anti-goodharting in an AI is a corrigibility technique. The two concepts work well with each other.

Vanessa Kosoy's Shortform

Goodharting is about what happens in situations where "good" is undefined or uncertain or contentious, but still gets used for optimization. There are situations where it's better-defined, and situations where it's ill-defined, and an anti-goodharting agent strives to optimize only within scope of where it's better-defined. I took "lovecraftian" as a proxy for situations where it's ill-defined, and base distribution of quantilization that's intended to oppose goodharting acts as a quantitative description of where it's taken as better-defined, so for this purpose base distribution captures non-lovecraftian situations. Of the options you listed for debate, the distribution from imitation learning seems OK for this purpose, if amended by some anti-weirdness filters to exclude debates that can't be reliably judged.

The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.

My point is that if anti-goodharting and not development of quantilization is taken as a goal, then calibration of quantilization is not the kind of thing that helps, it doesn't address the main issues. Like, even for quantilization, fiddling with base distribution and proxy utility is a more natural framing that's strictly more general than fiddling with the quantilization parameter. If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?

The use of debates for amplification in this framing is for corrigibility part of anti-goodharting, a way to redefine utility proxy and expand the base distribution, learning from how the debates at the boundary of the previous base distribution go. Quantilization seems like a fine building block for this, sampling slightly lovecraftian debates that are good, which is the direction where we want to expand the scope.

Morality is Scary

I'm leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values, preventing overoptimized weird/controversial situations, even at the cost of astronomical waste. Absence of x-risks, including AI risks, is generally good. Within this environment, the civilization might be able to eventually work out more about values, expanding the scope of their definition and thus allowing stronger optimization. Here corrigibility is in part about continually picking up the values and their implied scope from the predictions of how they would've been worked out some time in the future.

Vanessa Kosoy's Shortform

I'm not sure this attacks goodharting directly enough. Optimizing a system for proxy utility moves its state out-of-distribution where proxy utility generalizes training utility incorrectly. This probably holds for debate optimized towards intended objectives as much as for more concrete framings with state and utility.

Dithering across the border of goodharting (of scope of a proxy utility) with quantilization is actionable, but isn't about defining the border or formulating legible strategies for what to do about optimization when approaching the border. For example, one might try for shutdown, interrupt-for-oversight, or getting-back-inside-the-borders when optimization pushes the system outside, which is not quantilization. (Getting-back-inside-the-borders might even have weird-x-risk prevention as a convergent drive, but will oppose corrigibility. Some version of oversight/amplification might facilitate corrigibility.)

Debate seems more useful for amplification, extrapolating concepts in a way humans would, in order to become acceptable proxies in wider scopes, so that more and more debates become non-lovecraftian. This is a different concern from setting up optimization that works with some fixed proxy concepts as given.

P₂B: Plan to P₂B Better

more planners

This seems tenuous compared to "more planning substrate". Redundancy and effectiveness specifically through setting up a greater number of individual planners, even if coordinated, is likely an inferior plan. There are probably better uses of hardware that don't have this particular shape.

My take on Vanessa Kosoy's take on AGI safety

I'd say alignment should be about values, so only your "even better alignment" qualifies. The non-agentic AI safety concepts like corrigibility, that might pave the way to aligned systems if controllers manage to keep their values throughout the process, are not themselves examples of alignment.

The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument

Sleeping Beauty and other anthropic problems considered in terms of bets illustrate how most ways of assigning anthropic probabilities are not about beliefs of fact in a general sense, their use is more of appeal to consequences. At the very least the betting setup should remain a salient companion to these probabilities whenever they are produced. Anthropic probabilities make no more sense on their own, without the utilities, than whatever arbitrary numbers you get after applying Bolker-Jeffrey rotation. The main difference is in how the utilities of anthropics are not as arbitrary, so failing to carefully discuss what they are in a given setup makes the whole construction ill-justified.

Oracle predictions don't apply to non-existent worlds

The prediction is why you grab your coat, it's both meaningful and useful to you, a simple counterexample to the sentiment that since correctness scope of predictions is unclear, they are no good. The prediction is not about the coat, but that dependence wasn't mentioned in the arguments against usefulness of predictions above.

Oracle predictions don't apply to non-existent worlds

IMO, either the Oracle is wrong, or the choice is illusory

This is similar to determinism vs. free will, and suggests the following example. The Oracle proclaims: "The world will follow the laws of physics!". But in the counterfactual where an agent takes a decision that won't actually be taken, the fact of taking that counterfactual decision contradicts the agent's cognition following the laws of physics. Yet we want to think about the world within the counterfactual as if the laws of physics are followed.

Oracle predictions don't apply to non-existent worlds

No, I suspect it's a correct ingredient of counterfactuals, one I didn't see discussed before, not an error restricted to a particular decision theory. There is no contradiction in considering each of the counterfactuals as having a given possible decision made by the agent and satisfying the Oracle's prediction, as the agent doesn't know that it won't make this exact decision. And if it does make this exact decision, the prediction is going to be correct, just like the possible decision indexing the counterfactual is going to be the decision actually taken. Most decision theories allow explicitly considering different possible decisions, and adding correctness of the Oracle's prediction into the mix doesn't seem fundamentally different in any way, it's similarly sketchy.

Load More