“embedded self-justification,” or something like that

nostalgebraist

preamble

Sometimes I wonder what the MIRI-type crowd thinks about some issue related to their interests. So I go to alignmentforum.org, and quickly get in over my head, lost in a labyrinth of issues I only half understand.

I can never tell whether they’ve never thought about the things I’m thinking about, or whether they sped past them years ago. They do seem very smart, that’s for sure.

But if they have terms for what I’m thinking of, I lack the ability to find those terms among the twists of their mirrored hallways. So I go to tumblr.com, and just start typing.

parable (1/3)

You’re an “agent” trying to take good actions over time in a physical environment under resource constraints. You know, the usual.

You currently spend a lot of resources doing a particular computation involved in your decision procedure. Your best known algorithm for it is O(N^n) for some n.

You’ve worked on the design of decision algorithms before, and you think this could perhaps be improved. But to find it, you’d have to shift resources some away from running the algorithm for a time, putting them into decision algorithm design instead.

You do this. Almost immediately, you discover an O(N^(n-1)) algorithm. Given the large N you face, this will dramatically improve all your future decisions.

Clearly (…“clearly”?), the choice to invest more in algorithm design was a good one.

Could you have anticipated this beforehand? Could you have acted on that knowledge?

parable (2/3)

Oh, you’re so very clever! By now you’ve realized you need, above and beyond your regular decision procedure to guide your actions in the outside world, a “meta-decision-procedure” to guide your own decision-procedure-improvement efforts.

Your meta-decision-procedure does require its own resource overhead, but in exchange it tells you when and where to spend resources on R&D. All your algorithms are faster now. Your decisions are better, their guiding approximations less lossy.

All this, from a meta-decision-procedure that’s only a first draft. You frown over the resource overhead it charges, and wonder whether it could be improved.

You try shifting some resources away from “regular decision procedure design” into “meta-decision-procedure-design.” Almost immediately, you come up with a faster and better procedure.

Could you have anticipated this beforehand? Could you have acted on that knowledge?

parable (3/3)

Oh, you’re so very clever! By now you’ve realized you need, above and beyond your meta-meta-meta-decision-procedure, a “meta-meta-meta-meta-decision-procedure” to guide your meta-meta-meta-decision-procedure-improvement efforts.

Way down on the object level, you have not moved for a very long time, except to occasionally update your meta-meta-meta-meta-rationality blog.

Way down on the object level, a dumb and fast predator eats you.

Could you have anticipated this beforehand? Could you have acted on that knowledge?

the boundary

You’re an “agent” trying to take good actions, et cetera. Your actions are guided by some sort of overall “model” of how things are.

There are, inevitably, two parts to your model: the interior and the boundary.

The interior is everything you treat as fair game for iterative and reflective improvement. For “optimization,” if you want to put it that way. Facts in the interior are subject to rational scrutiny; procedures in the interior have been judged and selected for their quality, using some further procedure.

The boundary is the outmost shell, where resource constraints force the regress to stop. Perhaps you have a target and an optimization procedure. If you haven’t tested the optimization procedure against alternatives, it’s in your boundary. If you have, but you haven’t tested your optimization-procedure-testing-procedure against alternatives, then it’s in your boundary. Et cetera.

You are a business. You do retrospectives on your projects. You’re so very clever, in fact, that you do retrospectives on your retrospective process, to improve it over time. But how do you improve these retro-retros? You don’t. They’re in your boundary.

Of everything you know and do, you trust the boundary the least. You have applied less scrutiny to it than anything else. You suspect it may be shamefully suboptimal, just like the previous boundary, before you pushed it into the interior.

embedded self-justification

You would like to look back on the resources you spend – each second, each joule – and say, “I spent it the right way.” You would like to say, “I have a theory of what it means to decide well, and I applied it, and so I decided well.”

Why did you spend it as you did, then? You cannot answer, ever, without your answer invoking something on the boundary.

How did you spent that second? On looking for a faster algorithm. Why? Because your R&D allocation procedure told you to. Why follow that procedure? Because it’s done better than others in the past. How do you know? Because you’ve compared it to others. Which others? Under what assumptions? Oh, your procedure-experimentation procedure told you. And how do you know it works? Eventually you come to the boundary, and throw up your hands: “I’m doing the best I can, okay!”

If you lived in a simple and transparent world, maybe you could just find the optimal policy once and for all. If you really were literally the bandit among the slot machines – and you knew this, perfectly, with credence 1 – maybe you could solve for the optimal explore/exploit behavior and then do it.

But your world isn’t like that. You know this, and know that you know it. Even if you could obtain a perfect model of your world and beings like you, you wouldn’t be able to fit it inside your own head, much less run it fast enough to be useful. (If you had a magic amulet, you might be able to fit yourself inside your own head, but you live in reality.)

Instead, you have detailed pictures of specific fragments of the world, in the interior and subject to continuous refinement. And then you have pictures of the picture-making process, and so on. As you go further out, the pictures get coarser and simpler, because their domain of description becomes ever vaster, while your resources remain finite, and you must nourish each level with a portion of those resources before the level above it even becomes thinkable.

At the end, at the boundary, you have the coarsest picture, a sort of cartoon. There is a smiling stick figure, perhaps wearing a lab coat to indicate scientific-rational values. It reaches for the lever of a slot machine, labeled “action,” while peering into a sketch of an oscilloscope, labeled “observations.” A single arrow curls around, pointing from the diagram back into the diagram. It is labeled “optimization,” and decorated with cute little sparkles and hearts, to convey its wonderfulness. The margins of the page are littered with equations, describing the littlest of toy models: bandit problems, Dutch book scenarios, Nash equilibria under perfect information.

In the interior, there are much richer, more beautiful pictures that are otherwise a lot like this one. In the interior, meta-learning algorithms buzz away on a GPU, using the latest and greatest procedures for finding procedures, justified in precise terms in your latest paper. You gesture at a whiteboard as you prioritize options for improving the algorithms. Your prioritization framework has gone through rigorous testing.

Why, in the end, do you do all of it? Because you are the little stick figure in the lab coat.

coda

What am I trying to get at, here?

Occasionally people talk about the relevance of computational complexity issues to AI and its limits. Gwern has a good page on why these concerns can’t place useful bounds on the potential of machine intelligence in the way people sometimes argue they do.

Yet, somehow I feel an unscratched itch when I read arguments like Gwern’s there. They answer the question I think I’m asking when I seek them out, but at the end I feel like I really meant to ask some other question instead.

Given computational constraints, how “superhuman” could an AI be? Well, it could just do what we do, but sped up – that is, it could have the same resource efficiency but more resources per unit time. That’s enough to be scary. It could also find more efficient algorithms and procedures, just as we do in our own research – but it would find them ever faster, more efficiently.

What remains unanswered, though, is whether there is any useful way of talking about doing this (the whole thing, including the self-improvement R&D) well, doing it rationally, as opposed to doing it in a way that simply “seems to work” after the fact.

How would an AI’s own policy for investment in self-improvement compare to our own (to yours, to your society’s)? Could we look at it and say, “this is better”? Could the AI do so? Is there anything better than simply bumbling around in concept-space, in a manner that perhaps has many internal structures of self-justification but is not known to work as a whole? Is there such a thing as (approximate) knowledge about the right way to do all of it that is still small enough to fit inside the agent on which it passes judgment?

Can you represent your overall policy, your outermost strategy-over-strategies considered a response to your entire situation, in a way that is not a cartoon, a way real enough to defend itself?

What is really known about the best way to spend the next unit of resources? I mean, known at the level of the resource-spenders, not as a matter of external judgment? Can anything definite be said about the topic in general except “it is possible to do better or worse, and it is probably possible to do better than we do now?” If not, what standard of rationality do we have left to apply beyond toy models, to ourselves or our successors?

It seems to me that there are roughly two types of "boundary" to think about: ceilings and floors.

Floors are aka the foundations. Maybe a system is running on a basically Bayesian framework, or (alternately) logical induction. Maybe there are some axioms, like ZFC. Going meta on floors involves the kind of self-reference stuff which you hear about most often: Gödel's theorem and so on. Floors are, basically, pretty hard to question and improve (though not impossible).
Ceilings are fast heuristics. You have all kinds of sophisticated beliefs in the interior, but there's a question of which inferences you immediately make, without doing any meta to consider what direction to think in. (IE, you do generally do some meta to think about what direction to think in; but, this "tops out" at some level, at which point the analysis has to proceed without meta.) Ceilings are relatively easy to improve. For example, the AlphaGo move proposal network and evaluation network (if I recall the terms correctly). These have cheap updates which can be made frequently, via observing the results of reasoning. These incremental updates then help the more expensive tree-search reasoning to be even better.

Both floors and ceilings have a flavor of "the basic stuff that's actually happening" -- the interior is built out of a lot of boundary stuff, and small changes to boundary will create large shifts in interior. However, floors and ceilings are very different. Tweaking floor is relatively dangerous, while tweaking ceiling is relatively safe. Returning to the AlphaGo analogy, the floor is like the model of the game which allows tree search. The floor is what allows us to create a ceiling. Tweaks to the floor will tend to create large shifts in the ceiling; tweaks to the ceiling will not change the floor at all.

(Perhaps other examples won't have as clear a floor/ceiling division as AlphaGo; or, perhaps they still will.)

What remains unanswered, though, is whether there is any useful way of talking about doing this (the whole thing, including the self-improvement R&D) well, doing it rationally, as opposed to doing it in a way that simply “seems to work” after the fact.

[...] Is there anything better than simply bumbling around in concept-space, in a manner that perhaps has many internal structures of self-justification but is not known to work as a whole? [...]

Can you represent your overall policy, your outermost strategy-over-strategies considered a response to your entire situation, in a way that is not a cartoon, a way real enough to defend itself?

My intuition is that the situation differs, somewhat, for floors and ceilings.

For floors, there are fundamental logical-paradox-flavored barriers. This relates to MIRI research on tiling agents.
For ceilings, there are computational-complexity-flavored barriers. You don't expect to have a perfect set of heuristics for fast thinking. But, you can have strategies relating to heuristics which have universal-ish properties. Like, logical induction is an "uppermost ceiling" (takes the fixed point of recursive meta) such that, in some sense, you know you're doing the best you can do in terms of tracking which heuristics are useful; you don't have to spawn further meta-analysis on your heuristic-forming heuristics. HOWEVER, it is also very very slow and impractical for building real agents. It's the agent that gets eaten in your parable. So, there's more to be said with respect to ceilings as they exist in reality.

Thanks, the floor/ceiling distinction is helpful.

I think "ceilings as they exist in reality" is my main interest in this post. Specifically, I'm interested in the following:

any resource-bound agent will have ceilings, so an account of embedded rationality needs a "theory of having good ceilings"
a "theory of having good ceilings" would be different from the sorts of "theories" we're used to thinking about, involving practical concerns at the fundamental desiderata level rather than as a matter of implementing an ideal after it's been specified

In more detail: it's one thing to be able to assess quick heuristics, and it's another (and better) one to be able to assess quick heuristics quickly. It's possible (maybe) to imagine a convenient situation where the theory of each "speed class" among fast decisions is compressible enough to distill down to something which can be run in that speed class and still provide useful guidance. In this case there's a possibility for the theory to tell us why our behavior as a whole is justified, by explaining how our choices are "about as good as can be hoped for" during necessarily fast/simple activity that can't possibly meet our more powerful and familiar notions of decision rationality.

However, if we can't do this, it seems like we face an exploding backlog of justification needs: every application of a fast heuristic now requires a slow justification pass, but we're constantly applying fast heuristics and there's no room for the slow pass to catch up. So maybe a stronger agent could justify what we do, but we couldn't.

I expect helpful theories here to involve distilling-into-fast-enough-rules on a fundamental level, so that "an impractically slow but working version of the theory" is actually a contradiction in terms.

The way I understand your division of floors and sealing, the sealing is simply the highest level meta there is, and the agent has *typically* no way of questioning it. The ceiling is just "what the algorithm is programed to do". Alpha Go is had programed to update the network weights in a certain way in response to the training data.

What you call floor for Alpha Go, i.e. the move evaluations, are not even boundaries (in the sense nostalgebraist define it), that would just be the object level (no meta at all) policy.

I think this structure will be the same for any known agent algorithm, where by "known" I mean "we know how it works", rather than "we know that it exists". However Humans seems to be different? When I try to introspect it all seem to be mixed up, with object level heuristics influencing meta level updates. The ceiling and the floor are all mixed together. Or maybe not? Maybe we are just the same, i.e. having a definite top level, hard coded, highest level meta. Some evidence of this is that sometimes I just notice emotional shifts and/or decisions being made in my brain, and I just know that no normal reasoning I can do will have any effect on this shift/decision.

What you call floor for Alpha Go, i.e. the move evaluations, are not even boundaries (in the sense nostalgebraist define it), that would just be the object level (no meta at all) policy.

I think in general the idea of the object level policy with no meta isn't well-defined, if the agent at least does a little meta all the time. In AlphaGo, it works fine to shut off the meta; but you could imagine a system where shutting off the meta would put it in such an abnormal state (like it's on drugs) that the observed behavior wouldn't mean very much in terms of its usual operation. Maybe this is the point you are making about humans not having a good floor/ceiling distinction.

But, I think we can conceive of the "floor" more generally. If the ceiling is the fixed structure, e.g. the update for the weights, the "floor" is the lowest-level content -- e.g. the weights themselves. Whether thinking at some meta-level or not, these weights determine the fast heuristics by which a system reasons.

I still think some of what nostalgebraist said about boundaries seems more like the floor than the ceiling.

The space "between" the floor and the ceiling involves constructed meta levels, which are larger computations (ie not just a single application of a heuristic function), but which are not fixed. This way we can think of the floor/ceiling spectrum as small-to-large: the floor is what happens in a very small amount of time; the ceiling is the whole entire process of the algorithm (learning and interacting with the world); the "interior" is anything in-between.

Of course, this makes it sort of trivial, in that you could apply the concept to anything at all. But the main interesting thing is how an agent's subjective experience seems to interact with floors and ceilings. IE, we can't access floors very well because they happen "too quickly", and besides, they're the thing that we do everything with (it's difficult to imagine what it would mean for a consciousness to have subjective "access to" its neurons/transistors). But we can observe the consequences very immediately, and reflect on that. And the fast operations can be adjusted relatively easy (e.g. updating neural weights). Intermediate-sized computational phenomena can be reasoned about, and accessed interactively, "from the outside" by the rest of the system. But the whole computation can be "reasoned about but not updated" in a sense, and becomes difficult to observe again (not "from the outside" the way smaller sub-computations can be observed).

Just a quick note: Sometimes there is a way out of this kind of infinite regress by implementing an algorithm that approximates the limit. Of course, you can also be put back into an infinite regress by asking if there is a better approximation.

A lot of what you write here seems related to my notion of Turing Reinforcement Learning. In Turing RL we consider an AI comprising of a "core" RL agent and an "envelope" which is a computer on which the core can run programs (somewhat similarly to neural Turing machines). From the point of the view of the core, the envelope is a component of its environment (in addition to its usual I/O), about which it has somewhat stronger priors than about the rest. Such a system learns how to make optimal use of the envelope's computing resources. Your "boundary" corresponds to the core, which is the immutable part of the algorithm that produces everything else. Regarding the "justification" of why a particular core algorithm is correct, the justification should come from regret bounds we prove about this algorithm w.r.t. some prior over incomplete models. Incomplete models are the solution to "even if you could obtain a perfect model of your world and beings like you, you wouldn’t be able to fit it inside your own head". Instead of obtaining a perfect model, the agent learns all patterns (incomplete models) in the world that it can fit into its head, and exploits these patterns for gain. More precisely, in Turing RL the agent starts with some small class of patterns that the core can fit into its head, and bootstraps from those to a larger class of patterns, accounting for a cost-benefit analysis of resource use. This way, the regret bound satisfied by the core algorithm should lead to even stronger guarantees for the system as a whole (for example this).

I think that this infinite regress can be converted into a loop. Given an infinite sequence of layers, in which the job of layer $n + 1$ is to optimise layer $n$ . Each layer is a piece of programming code. After the first couple of layers, these layers will start to look very similar. You could have layer 3 being a able to optimize both layer 2 and layer 3.

One model is that your robot just sits and thinks for an hour. At the end of that hour, it designs what it thinks is the best code it can come up with, and runs that. To the original AI, anything outside the original hour is external, it is answering the question "what pattern of bits on this hard disk will lead to the best outcome." It can take all these balances and tradeoffs into account in whatever way it likes. If it hasn't come up with any good ideas yet, it could copy its code, add a crude heuristic that makes it run randomly when thinking (to avoid the preditors) and think for longer.

It seems to me that there are roughly two types of "boundary" to think about: ceilings and floors.

Floors are aka the foundations. Maybe a system is running on a basically Bayesian framework, or (alternately) logical induction. Maybe there are some axioms, like ZFC. Going meta on floors involves the kind of self-reference stuff which you hear about most often: Gödel's theorem and so on. Floors are, basically, pretty hard to question and improve (though not impossible).
Ceilings are fast heuristics. You have all kinds of sophisticated beliefs in the interior, but there's a question of which inferences you immediately make, without doing any meta to consider what direction to think in. (IE, you do generally do some meta to think about what direction to think in; but, this "tops out" at some level, at which point the analysis has to proceed without meta.) Ceilings are relatively easy to improve. For example, the AlphaGo move proposal network and evaluation network (if I recall the terms correctly). These have cheap updates which can be made frequently, via observing the results of reasoning. These incremental updates then help the more expensive tree-search reasoning to be even better.

(Perhaps other examples won't have as clear a floor/ceiling division as AlphaGo; or, perhaps they still will.)

What remains unanswered, though, is whether there is any useful way of talking about doing this (the whole thing, including the self-improvement R&D) well, doing it rationally, as opposed to doing it in a way that simply “seems to work” after the fact.

[...] Is there anything better than simply bumbling around in concept-space, in a manner that perhaps has many internal structures of self-justification but is not known to work as a whole? [...]

Can you represent your overall policy, your outermost strategy-over-strategies considered a response to your entire situation, in a way that is not a cartoon, a way real enough to defend itself?

My intuition is that the situation differs, somewhat, for floors and ceilings.

For floors, there are fundamental logical-paradox-flavored barriers. This relates to MIRI research on tiling agents.
For ceilings, there are computational-complexity-flavored barriers. You don't expect to have a perfect set of heuristics for fast thinking. But, you can have strategies relating to heuristics which have universal-ish properties. Like, logical induction is an "uppermost ceiling" (takes the fixed point of recursive meta) such that, in some sense, you know you're doing the best you can do in terms of tracking which heuristics are useful; you don't have to spawn further meta-analysis on your heuristic-forming heuristics. HOWEVER, it is also very very slow and impractical for building real agents. It's the agent that gets eaten in your parable. So, there's more to be said with respect to ceilings as they exist in reality.

Thanks, the floor/ceiling distinction is helpful.

I think "ceilings as they exist in reality" is my main interest in this post. Specifically, I'm interested in the following:

any resource-bound agent will have ceilings, so an account of embedded rationality needs a "theory of having good ceilings"
a "theory of having good ceilings" would be different from the sorts of "theories" we're used to thinking about, involving practical concerns at the fundamental desiderata level rather than as a matter of implementing an ideal after it's been specified

What you call floor for Alpha Go, i.e. the move evaluations, are not even boundaries (in the sense nostalgebraist define it), that would just be the object level (no meta at all) policy.

What you call floor for Alpha Go, i.e. the move evaluations, are not even boundaries (in the sense nostalgebraist define it), that would just be the object level (no meta at all) policy.

I still think some of what nostalgebraist said about boundaries seems more like the floor than the ceiling.

16

“embedded self-justification,” or something like that

16