Vladimir Nesov

Wiki Contributions


Rant on Problem Factorization for Alignment

The agents at the top of most theoretical infinite bureaucracies should be thought of as already superhumanly capable and aligned, not weak language models, because the way IDA works iteratively retrains models on output of bureaucracy, so that agents at higher levels of the theoretical infinite bureaucracy are stronger (from later amplification/distillation epochs) than those at lower levels. It doesn't matter if an infinite bureaucracy instantiated for a certain agent fails to solve important problems, as long as the next epoch does better.

For HCH specifically, this is normally intended to apply to the HCHs, not to humans in it, but then the abstraction of humans being actual humans (exact imitations) leaks, and we start expecting something other than actual humans there. If this is allowed, if something less capable/aligned than humans can appear in HCH, then by the same token these agents should improve with IDA epochs (perhaps not of HCH, but of other bureaucracies) and those "humans" at the top of an infinite HCH should be much better than the starting point, assuming the epochs improve things.

Counterfactuals are Confusing because of an Ontological Shift

I'm guessing a good way to think about free will under determinism is with logical time that's different from physical time. The points/models that advance in logical time are descriptions of environment with different amount of detail, so that you advance in logical time by filling in more details, and sometimes it's your decisions that are filled in (at all of your instances and predictions-of simultaneously). This is different from physical time, where you fill in details in a particular way determined by laws of physics.

The ingredient of this point of view that's usually missing is that concrete models of environment (individual points of states of knowledge) should be allowed to be partial, only specify some of the data about the environment. Then, actual development of models in response to decisions is easier to see, it's not inherently a kind of illusion borne of lack of omniscience. This is in contrast to the usual expectation that the only thing with partial details is the states of knowledge about complete models of environment (with all possible details already filled in), so that partiality is built on top of lack of partiality.

The filling-in of partial models with logical time probably needs to be value-laden. Most counterfactuals are fictional, and the legible details of decision relevant fiction should preserve its moral significance. So it's veering in the direction of "social convention", though in a normative way, in the sense that value is not up for grabs. On the other hand, it's a possible way of understanding CEV as a better UDT, instead of as a separate additional construction with its own desiderata (the simulations of possible civilizations from CEV reappear in decision theory as counterfactuals developing in logical time).

Determinism doesn't seem like a central example of ontological shift, and bargaining seems like the concept of dealing with more general ontological shifts. You bargain with your variant in a different ontological context for doing valuable things. This starts with extrapolation of value to that context, so that it's not beyond the goodhart boundary, you grow confident in legible proxy goals that talk about that territory. It also seems to be a better framing for updatelessness, as bargaining among possible future epistemic states, acausal trade among them, or at least those that join the coalition of abiding by the decision of the epistemic past. This way, considering varying possible future moral states (~partial probutility functions) is more natural. The motivation to do that is so that the assumption of unchanging preference is not baked in into the decision theory, and it gets a chance of modeling mild optimization.

A Data limited future

An upload (an exact imitation of a human) is the most straightforward way of securing time for alignment research, except it's not plausible in our world for uploads to be developed before AGIs. The plausible similar thing is more capable language/multimodal models, steeped in human culture, where alignment guarantees at least a priori look very dubious. And an upload probably needs to be value-laden to be efficient enough to give an advantage, while remaining exact in morally relevant ways, though there's a glimmer of hope generalization can capture this without a need to explicitly set up a fixpoint through extrapolated values. Doing the same with Tool AIs or something is only slightly less speculative than directly developing aligned AGIs without that miracle, so the advantage of an upload is massive.

A Data limited future

In this history, uploads (via data from passive BMIs) precede AGIs, a stronger prospect of alignment.

Reward is not the optimization target

The conjecture I brought up that deceptive alignment relies on selected policies being optimizers gives me the idea that something similar to your argument (where the target of optimization wouldn't matter, only the fact of optimization for anything at all) would imply that deceptive alignment is less likely to happen. I didn't mean to claim that I'm reading you as making this implication in the post, or believing it's true or relevant, that's instead an implication I'm describing in my comment.

Humans Reflecting on HRH

A point that doesn't seem to be in the water supply is that even superintelligences won't have (unerringly accurate estimates of) results of CEV to work with. Any predictions of values are goodhart cursed proxy values. Predictions that are not value-laden are even worse. So no AGIs that would want to run a CEV would be utility maximizers, and AGIs that are utility maximizers are maximizing something that isn't CEV of anything, including that of humanity.

Thus utility maximization is necessarily misaligned, not just very hard to align, until enough time has already passed for CEV to run its course, to completion and not merely in foretelling. Which likely never actually happens (reflection is unbounded), so utility maximization can only be approached with increasingly confident mild optimization. And there is currently mostly confusion on what mild optimization does as decision theory.

Abstracting The Hardness of Alignment: Unbounded Atomic Optimization

MIRI’s early work (for example modal combat and work on Loeb’s theorem) assumed that UAO would be instantiated through hand-written AI programs that were just good enough to improve themselves slightly, leading to an intelligence explosion (with a bunch of other assumptions).

Agent foundations work makes / needs no assumptions about how first AGIs are written, or intelligence explosion, it's not about that. It's about deconfusion, noticing and formulating concepts that help with thinking about agents-in-a-very-loose-sense.

Reward is not the optimization target

The deceptive alignment worry is that there is some goal about the real world at all. Deceptive alignment breaks robustness of any properties of policy behavior, not just the property of following reward as a goal in some unfathomable sense.

So refuting this worry requires quieting the more general hypothesis that RL selects optimizers with any goals of their own, doesn't matter what goals those are. It's only the argument for why this seems plausible that needs to refer to reward as related to the goal of such an optimizer, but the way the argument goes suggests that the optimizer so selected would instead have a different goal. Specifically, optimizing for an internalized representation of reward seems like a great way of being rewarded, surviving changes of weights, such optimizers would be straightforwardly selected if there are no alternatives to that closer in reach. Since RL is not perfect, there would be optimizers for other goals nearby, goals that care about the real world (and not just about optimizing the reward exclusively, meticulously ignoring everything else). If an optimizer like that succeeds in becoming deceptively aligned (let alone gradient hacking), the search effectively stops and a honestly aligned optimizer is never found.

Corrigibility, anti-goodharting, mild optimization, unstable current goals, and goals that are intractable about distant future seem related (though not sufficient for alignment without at least value-laden low impact). The argument about deceptive alignment is a problem for using RL to find anything in this class, something that is not an optimizer at all and so is not obviously misaligned. It would be really great if RL doesn't tend to select optimizers!

TurnTrout's shortform feed

This reasoning seems to prove too much.

It does add up to normality, it's not proving things about current behavior or current-goal content of near-future AGIs. An unknown normative target doesn't say not to do the things you normally do, it's more of a "I beseech you, in the bowels of Christ, to think it possible you may be mistaken" thing.

The salient catastrophic alignment failure here is to make AGIs with stable values that capture some variation on current unstable human values, and won't allow their further development. If the normative target is very far from current unstable human values, making current values stable falls very short of the normative target, makes future relatively worthless.

That's the kind of thing my point is intended to nontrivially claim, that AGIs with any stable immediately-actionable goals that can be specified in the following physical-time decades or even centuries are almost certainly catastrophically misaligned. So AGIs must have unstable goals, softly optimized-for, aligned to current (or value-laden predicted future) human unstable goals, mindful of goodhart.

I disagree with CEV as I recall it

The kind of CEV I mean is not very specific, it's more of a (sketch of a solution to the) problem of doing a first pass on preparing to define goals for an actual optimizer, one that doesn't need to worry as much about goodhart and so can make more efficient use of the future at scale, before expansion of the universe makes more stuff unreachable.

So when I say "CEV" I mostly just mean "normative alignment target", with some implied clarifications on what kind of thing it might be.

it's more likely than not that your stable values like dogs too

That's a very status quo anchored thing. I don't think dog-liking is a feature of values stable under reflection if the environment is allowed to change completely, even if in the current environment dogs are salient. Stable values are about the whole world, with all its AGI-imagined femtotech-rewritten possibilities. This world includes dogs in some tiny corner of it, but I don't see how observations of current attitudes hold much hope in offering clues about legible features of stable values. It is much too early to tell what stable values could possibly be. That's why CEV, or rather the normative alignment target, as a general concept that doesn't particularly anchor to the details Yudkowsky talked about, but referring to stable goals in this very wide class of environments, seems to me crucially important to keep distinct from current human values.

Another point is that attempting to ask what current values even say about very unusual environments doesn't work, it's so far from the training distributions that any respose is pure noise. Current concepts are not useful for talking about features of sufficiently unusual environments, you'd need new concepts specialized for those environments. (Compare with asking what CEV says about currently familiar environments.)

And so there is this sandbox of familiar environments that any near-term activity must remain within on pain of goodhart-cursing outcomes that step outside of it, because there is no accurate knowledge of utility in environments outside of it. The project of developing values beyond the borders of currently comprehensible environments is also a task of volition extrapolation, extending the goodhart boundary in desirable directions by pushing on it from the inside (with reflection on values, not with optimization based on bad approximations of values).

Robustness to Scaling Down: More Important Than I Thought

This is a useful idea, it acts to complement the omnipotence test where you ask if AI as a whole still does the right thing if it's scaled up to an absurd degree (but civilization outside the AI isn't scaled up, which is like its part for alignment purposes). In particular, any reflectively stable maximizer that's not aimed exactly and with no approximations at CEV fails this because goodhart. The traditional answer is to aim it exactly, while the more recent answer is to prevent maximization at the decision theory level, so that acting very well is still not maximization.

Robustness to scaling down instead makes some parts of the system ineffectual, even if that shouldn't plausibly happen, and considers what happens then, asks if the other parts would take advantage and cause trouble. What if civilization, seen as a part of the AI for purposes of alignment, holding its values, doesn't work very well, would AI-except-civilization cause trouble?

I imagine the next step should have some part compromised by a capable adversary, acting purposefully to undermine the system. Robustness to catastrophic failure in a part of the design rather than to scaling down. This seems related to inner alignment and corrigibility: making sure parts don't lose their purposes, while having the parts themselves cooperate with fixing their purposes and not acting outside their purposes.

Load More