Wiki Contributions


Testing The Natural Abstraction Hypothesis: Project Update

The #P-complete problem is to calculate the distribution of some variables in a Bayes net given some other variables in the Bayes net, without any particular restrictions on the net or on the variables chosen.

Formal statement of the Telephone Theorem: We have a sequence of Markov blankets forming a Markov chain . Then in the limit  mediates the interaction between  and  (i.e. the distribution factors according to ), for some  satisfying

with probability 1 in the limit.

Information At A Distance Is Mediated By Deterministic Constraints

More like: exponential family distributions are a universal property of information-at-a-distance in large complex systems. So, we can use exponential models without any loss of generality when working with information-at-a-distance in large complex systems.

That's what I hope to show, anyway.

Information At A Distance Is Mediated By Deterministic Constraints

Yup, that's the direction I want. If the distributions are exponential family, then that dramatically narrows down the space of distributions which need to be represented in order to represent abstractions in general. That means much simpler data structures - e.g. feature functions and Lagrange multipliers, rather than whole distributions.

Information At A Distance Is Mediated By Deterministic Constraints

Roughly speaking, the generalized KPD says that if the long-range correlations are low dimensional, then the whole distribution is exponential family (modulo a few "exceptional" variables). The theorem doesn't rule out the possibility of high-dimensional correlations, but it narrows down the possible forms a lot if we can rule out high-dimensional correlations some other way. That's what I'm hoping for: some simple/common conditions which limit the dimension of the long-range correlations, so that gKPD can apply.

This post says that those long range correlations have to be mediated by deterministic constraints, so if the dimension of the deterministic constraints is low, then that's one potential route. Another potential route is some kind of information network flow approach - i.e. if lots of information is conserved along one "direction", then that should limit information flow along "orthogonal directions", which would mean that long-range correlations are limited between "most" local chunks of the graph.

The alignment problem in different capability regimes

Claim: the core of the alignment problem is conserved across capability levels. If a particular issue only occurs at a particular capability level, then the issue is usually "not really about alignment" in some sense.

Roughly speaking, if I ask a system for something, and then the result is not really what I wanted, but the system "could have" given the result I wanted in some sense, then that's an alignment problem regardless of whether the system is a superintelligent AI or google maps. Whether it's a simple system with a bad user interface, or a giant ML system with an unfriendly mesa-optimizer embedded in it, the conceptual core of the problem isn't all that different.

The difference is mainly in how-bad-it-is for the system to be misaligned (for a given degree-of-misalignment). That does have important implications for how we think about AI safety - e.g. we can try to create systems which are reasonably safe without really solving the alignment problem. But I think it's useful to distinguish safety vs alignment here - e.g. a proposal to make an AI safe by making sure it doesn't do anything very far out of the training distribution might be a reasonable safety proposal without really saying much about the alignment problem.

Similarly, proposals along the lines of "simulate a human working on the alignment problem for a thousand years" are mostly safety proposals, and pass the buck on the alignment parts of the problem. (Which is not necessarily bad!)

The distinction matters because, roughly speaking, alignment advances should allow us to leverage more-capable systems while maintaining any given safety level. On the other hand, safety-without-alignment mostly chooses a point on the safety-vs-capabilities pareto surface without moving that surface. (Obviously this is a severe oversimplification of a problem with a lot more than two dimensions, but I still think it's useful.)

Information At A Distance Is Mediated By Deterministic Constraints

We can still view these as travelling through many layers - the light waves have to propagate through many lightyears of mostly-empty space (and it could attenuate or hit things along the way). The photo has to last many years (and could randomly degrade a little or be destroyed at any moment along the way).

What makes it feel like "one hop" intuitively is that the information is basically-perfectly conserved at each "step" through spacetime, and there's in a symmetry in how the information is represented.

Welcome & FAQ!

I recommend that the title make it clearer that non-members can now submit alignment forum content for review, since this post is cross-posted on LW.

Knowledge is not just precipitation of action

Here's a similarly-motivated model which I have found useful for the knowledge of economic agents.

Rather than imagining that agents choose their actions as a function of their information (which is the usual picture), imagine that agents can choose their action for every world-state. For instance, if I'm a medieval smith, I might want to treat my iron differently depending on its composition.

In economic models, it's normal to include lots of constraints on agents' choices - like a budget constraint, or a constraint that our medieval smith cannot produce more than n plows per unit of iron. With agents choosing their actions in every world, we can introduce information as just another constraint: if I don't have information distinguishing two worlds, then I am constrained to take the same action in those two worlds. If the medieval smith cannot distinguish iron with two different compositions, then the action taken in those two worlds must be the same.

One interesting feature of this model is that "knowledge goods" can be modeled quite naturally. In our smith example: if someone hands the smith a piece of paper which has different symbols written on it in worlds where the iron has different composition, and the smith can take different actions depending on what the paper says, then the smith can use that to take different actions in worlds where the iron has different composition.

Traps of Formalization in Deconfusion

Instead of capturing the intuitions present in our confused understanding, John proposes to start with one of the applications and only focus on formalizing the concept for this specific purpose. [...] A later step is to attempt unification of the different formalization for the many applications.

Important clarification here: in order for this to work well, it is necessary to try multiple different use-cases and then unify. This is not a "start with one and then maybe get around to others in the indefinite future" sort of thing. I generally do not expect to end up with a good deconfusion of a concept using less than 3 use-cases; think of it like attempting to triangulate a point on a map.

The point of looking at use-cases one-at-a-time to start is that certain useful frames or considerations will stand out in the context of a particular use-case; it's easier to recognize key constraints in context. Then, you want to try to carry over that frame or consideration or constraint to the other use-cases and see what it looks like in those contexts.

Refactoring Alignment (attempt #2)

For a while, I've thought that the strategy of "split the problem into a complete set of necessary sub-goals" is incomplete. It produces problem factorizations, but it's not sufficient to produce good problem factorizations - it usually won't cut reality at clean joints. That was my main concern with Evan's factorization, and it also applies to all of these, but I couldn't quite put my finger on what the problem was.

I think I can explain it now: when I say I want a factorization of alignment to "cut reality at the joints", I think what I mean is that each subproblem should involve only a few components of the system (ideally just one).

Inner/outer alignment discussion usually assumes that our setup follows roughly the structure of today's ML: there's a training process before deployment, there's a training algorithm with training data/environment and training objective, there's a trained architecture and initial parameters. These are the basic "components" which comprise our system. There's a lot of variety in the details, but these components are usually each good abstractions - i.e. there's nice well-defined APIs/Markov blankets between each component.

Ideally, alignment should be broken into subproblems which each depend only on one or a few components. For instance, outer alignment would ideally be a property of only the training objective (though that requires pulling out the pointers problem part somehow). Inner alignment would ideally be a property of only the training architecture, initial parameters, and training algorithm (though that requires pulling out the Dr Nefarious issue somehow). Etc. The main thing we don't want here is a "factorization" where some subsystems are involved in all of the subproblems, or where there's heavy overlap.

Why is this useful? You could imagine that we want to divide the task of building an aligned AI up between several people. To avoid constant git conflicts, we want each group to design one or a few different subsystems, without too much overlap between them. Each of those subsystems is designed to solve a particular subproblem. Of course we probably won't actually have different teams like that, but there are analogous benefits: when different subsystems solve different subproblems, we can tackle those subproblems relatively-independently, solve one without creating problems for the others. That's the point of a problem factorization.

Load More