Vladimir Nesov

Wiki Contributions


Against Time in Agent Models

Which propositions are valid with respect to time? How can we only allow propositions which don't get invalidated (EG if we don't know yet which will and will not be), and also, why do we want that?

This was just defining/motivating terms (including "validity") for this context, the technical answer is to look at the definition of specialization preorder, when it's being suggestively called "logical time". If an open is a "proposition", and a point being contained in an open is "proposition is true at that point", and a point stronger in specialization order than another point is "in the future of the other point", then in these terms we can say that "if a proposition is true at a point, it's also true at a future point", or that "propositions are valid with respect to time going forward", in the sense that their truth is preserved when moving from a point to a future point.

Logical time is intended to capture decision making, with future decisions advancing the agent's point of view in logical time. So if an agent reasons only in terms of propositions valid with respect to advancement of logical time, then any knowledge it accumulated remains valid as it makes decisions, that's some of the motivation for looking into reasoning in terms of such propositions.

You're saying a lot about what the "objects of study" are and aren't, but not very concretely, and I'm not getting the intuition for why this is important.

This is mostly about how domain theory describes computations, the interesting thing is how the computations are not necessarily in the domains at all, they only leave observations there, and it's the observations that the opens are ostensibly talking about, yet the goal might be to understand the computations, not just the observations (in program semantics, the goal is often to understand just the observations though, and a computation might be defined to only be its observed behavior). So one point I wanted to make is to push against the perspective where points of a space are what the logic of opens is intended to reason about, when the topology is not Frechet (has nontrivial specialization preorder).

But the important question for a proposed modeling language is how well it models what we're after.

Yeah, I've got nothing, just a sense of direction and a lot of theory to study, or else there would've been a post, not just a comment triggered by something on a vaguely similar topic. So this thread is in the same spirit as a comment I left a few months ago to a crackpot post, but that one was even more speculative, somewhat appropriate in a place like that...

Against Time in Agent Models

If you mark something like causally inescapable subsets of spacetime (not sure how this should be called), which are something like all unions of future lightcones, as open sets, then specialization preorder on spacetime points will agree with time. This topology on spacetime is non-Frechet (has nontrivial specialization preorder), while the relative topologies it gives on space-like subspaces (loci of states of the world "at a given time" in a loose sense) are Hausdorff, the standard way of giving a topology for such spaces. This seems like the most straightforward setting for treating physical time as logical time.

Against Time in Agent Models

I like specialization preorder as a setting for formulating these concepts. In a topological space, point y is stronger (more specialized) than point x iff all opens containing x also contain y. If opens are thought of as propositions, and specialization order as a kind of ("logical") time, with stronger points being in the future of weaker points, then this says that propositions must be valid with respect to time (that is, we want to only allow propositions that don't get invalidated). This setting motivates thinking of points not as objects of study, but as partial observations of objects of study, their shadows that develop according to specialization preorder. If a proposition is true about some partial observation of an object (a point of the space), it remains true when it develops further (in the future, for stronger points). The best we can capture objects of study is with neighborhood filters, but the conceptual distinction suggests that even in a sober space the objects of study are not necessarily points, they are merely observed through points.

This is just what Scott domains or more generally algebraic dcpos with Scott topology talk about, when we start with a poset of finite observations (about computations, the elusive objects of study), which is the specialization preorder of its Alexandrov topology, which then becomes Scott topology after soberification, adding points for partial observations that can be expressed in terms of Alexandrov opens on finite observations. Specialization order follows a computation, and opens formulate semidecidable properties. There are two different ways in which a computation is approximated: with a weaker observation/point, and with a weaker specification/proposition/open. One nice thing here is that we can recover points from opens, and then finite observations from the specialization poset of all partial observations/theories/ideals (as compact elements of a poset). So the different concepts fit together well, the rhetoric of observations and logical time has technical content that can be extracted just from the opens/propositions. This becomes even more concrete for coherent spaces, where finite observations are finite cliques on webs (of "atomic observations").

(This is mostly a keyword dump, pointing to standard theory that offers a way of making sense of logical time. The interesting questions are about how to make use of this to formulate FDT and avoid spurious proofs, possibly by reasoning at a particular point of a space, the logical moment of decision, without making use of its future. A distinction this point of view enforces that is usually missing in discussions of decision theory or reasoning about programs is between approximation with weaker observations vs. with weaker propositions. This is the distinction between different logical times and different states of knowledge about a computation.)

Morality is Scary

My point is that the alignment (values) part of AI alignment is least urgent/relevant to the current AI risk crisis. It's all about corrigibility and anti-goodharting. Corrigibility is hope for eventual alignment, and anti-goodharting makes inadequacy of current alignment and imperfect robustness of corrigibility less of a problem. I gave the relevant example of relatively well-understood values, preference for lower x-risks. Other values are mostly relevant in how their understanding determines the boundary of anti-goodharting, what counts as not too weird for them to apply, not in what they say is better. If anti-goodharting holds (too weird and too high impact situations are not pursued in planning and possibly actively discouraged), and some sort of long reflection is still going on, current alignment (details of what the values-in-AI prefer, as opposed to what they can make sense of) doesn't matter in the long run.

I include maintaining a well-designed long reflection somewhere into corrigibility, for without it there is no hope for eventual alignment, so a decision theoretic agent that has long reflection within its preference is corrigible in this sense. Its corrigibility depends on following a good decision theory, so that there actually exists a way for the long reflection to determine its preference so that it causes the agent to act as the long reflection wishes. But being an optimizer it's horribly not anti-goodharting, so can't be stopped and probably eats everything else.

An AI with anti-goodharting turned to the max is the same as AI with its stop button pressed. An AI with minimal anti-goodharting is an optimizer, AI risk incarnate. Stronger anti-goodharting is a maintenance mode, opportunity for fundamental change, weaker anti-goodharting makes use of more developed values to actually do the things. So a way to control the level of anti-goodharting in an AI is a corrigibility technique. The two concepts work well with each other.

Vanessa Kosoy's Shortform

Goodharting is about what happens in situations where "good" is undefined or uncertain or contentious, but still gets used for optimization. There are situations where it's better-defined, and situations where it's ill-defined, and an anti-goodharting agent strives to optimize only within scope of where it's better-defined. I took "lovecraftian" as a proxy for situations where it's ill-defined, and base distribution of quantilization that's intended to oppose goodharting acts as a quantitative description of where it's taken as better-defined, so for this purpose base distribution captures non-lovecraftian situations. Of the options you listed for debate, the distribution from imitation learning seems OK for this purpose, if amended by some anti-weirdness filters to exclude debates that can't be reliably judged.

The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.

My point is that if anti-goodharting and not development of quantilization is taken as a goal, then calibration of quantilization is not the kind of thing that helps, it doesn't address the main issues. Like, even for quantilization, fiddling with base distribution and proxy utility is a more natural framing that's strictly more general than fiddling with the quantilization parameter. If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?

The use of debates for amplification in this framing is for corrigibility part of anti-goodharting, a way to redefine utility proxy and expand the base distribution, learning from how the debates at the boundary of the previous base distribution go. Quantilization seems like a fine building block for this, sampling slightly lovecraftian debates that are good, which is the direction where we want to expand the scope.

Morality is Scary

I'm leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values, preventing overoptimized weird/controversial situations, even at the cost of astronomical waste. Absence of x-risks, including AI risks, is generally good. Within this environment, the civilization might be able to eventually work out more about values, expanding the scope of their definition and thus allowing stronger optimization. Here corrigibility is in part about continually picking up the values and their implied scope from the predictions of how they would've been worked out some time in the future.

Vanessa Kosoy's Shortform

I'm not sure this attacks goodharting directly enough. Optimizing a system for proxy utility moves its state out-of-distribution where proxy utility generalizes training utility incorrectly. This probably holds for debate optimized towards intended objectives as much as for more concrete framings with state and utility.

Dithering across the border of goodharting (of scope of a proxy utility) with quantilization is actionable, but isn't about defining the border or formulating legible strategies for what to do about optimization when approaching the border. For example, one might try for shutdown, interrupt-for-oversight, or getting-back-inside-the-borders when optimization pushes the system outside, which is not quantilization. (Getting-back-inside-the-borders might even have weird-x-risk prevention as a convergent drive, but will oppose corrigibility. Some version of oversight/amplification might facilitate corrigibility.)

Debate seems more useful for amplification, extrapolating concepts in a way humans would, in order to become acceptable proxies in wider scopes, so that more and more debates become non-lovecraftian. This is a different concern from setting up optimization that works with some fixed proxy concepts as given.

P₂B: Plan to P₂B Better

more planners

This seems tenuous compared to "more planning substrate". Redundancy and effectiveness specifically through setting up a greater number of individual planners, even if coordinated, is likely an inferior plan. There are probably better uses of hardware that don't have this particular shape.

My take on Vanessa Kosoy's take on AGI safety

I'd say alignment should be about values, so only your "even better alignment" qualifies. The non-agentic AI safety concepts like corrigibility, that might pave the way to aligned systems if controllers manage to keep their values throughout the process, are not themselves examples of alignment.

The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument

Sleeping Beauty and other anthropic problems considered in terms of bets illustrate how most ways of assigning anthropic probabilities are not about beliefs of fact in a general sense, their use is more of appeal to consequences. At the very least the betting setup should remain a salient companion to these probabilities whenever they are produced. Anthropic probabilities make no more sense on their own, without the utilities, than whatever arbitrary numbers you get after applying Bolker-Jeffrey rotation. The main difference is in how the utilities of anthropics are not as arbitrary, so failing to carefully discuss what they are in a given setup makes the whole construction ill-justified.

Load More