Specifically, imagine you use general-purpose search procedure which recursively invokes itself to solve subgoals for the purpose of solving some bigger goal.

If the search procedure's solutions to subgoals "change things too much", then they're probably not going to be useful. E.g. for Rubik's cubes, if you want to swap some of the cuboids, it does you know good if those swaps leave the rest of the cube scrambled.

Thus, to some extent, powerful capabilities would have to rely on some sort of impact regularization.

I'm thinking that natural impact regularization is related to the notion of "elegance" in engineering. Like if you have some bloated tool to solve a problem, then even if it's not strictly speaking an issue because you can afford the resources, it might feel ugly because it's excessive and puts mild constaints on your other underconstrained decisions, and so on. Meanwhile a simple, minimal solution often doesn't have this.

Natural impact regularization wouldn't guarantee safety, since it's still allows deviations that don't interfere with the AI's function, but it sort of reduces one source of danger which I had been thinking about lately, namely I had been thinking that the instrumental incentive is to search for powerful methods of influencing the world, where "power" connotes the sort of raw power that unstopably forces a lot of change, but really the instrumental incentive is often to search for "precise" methods of influencing the world, where one can push in a lot of information to effect narrow change.[1]

Maybe another word for it would be "natural inner alignment", since in a sense the point is that capabilities inevitably select for inner alignment. Here I mean "natural" in the sense of natural abstractions, i.e. something that a wide variety of cognitive algorithms would gravitate towards.

  1. ^

    A complication is that any one agent can only have so much bandwidth, which would sometimes incentivize more blunt control. I've been thinking bandwidth is probably going to become a huge area of agent foundations, and that it's been underexplored so far. (Perhaps because everyone working in alignment sucks at managing their bandwidth? 😅)

New to LessWrong?

New Answer
New Comment

2 Answers sorted by

Thane Ruthenis

Jan 31, 2024

82

I think, like a lot of things in agent foundations, this is just another consequence of natural abstractions.

The universe naturally decomposes into a hierarchy of subsystems; molecules to cells to organisms to countries. Changes in one subsystem only sparsely interact with the other subsystems, and their impact may vanish entirely at the next level up. A single cell becoming cancerous may yet be contained by the immune system, never impacting the human. A new engineering technique pioneered for a specific project may generalize to similar projects, and even change all such projects' efficiency in ways that have a macro-economic impact; but it will likely not. A different person getting elected the mayor doesn't much impact city politics in neighbouring cities, and may literally not matter at the geopolitical scale.

This applies from the planning direction too. If you have a good map of the environment, it'll decompose into the subsystems reflecting the territory-level subsystems as well. When optimizing over a specific subsystem, the interventions you're considering will naturally limit their impact to that subsystem: that's what subsystemization does, and counteracting this tendency requires deliberately staging sum-threshold attacks on the wider system, which you won't be doing.

In the Rubik's Cube example, this dynamic is a bit more abstract, but basically still applies. In a way similar to how the "maze" here kind-of decomposes into a top side and a bottom side.

A complication is that any one agent can only have so much bandwidth, which would sometimes incentivize more blunt control. I've been thinking bandwidth is probably going to become a huge area of agent foundations

I agree. I currently think "bandwidth" in terms like "what's the longest message I can 'inject' into the environment per time-step?" is what "resources" are in information-theoretic terms. See the output-side bottleneck in this formulation: resources are the action bandwidth, which is the size of the "plan" into which you have to "compress" your desired world-state if you want to "communicate" it to the environment.

really the instrumental incentive is often to search for "precise" methods of influencing the world, where one can push in a lot of information to effect narrow change

I disagree. I've given it a lot of thoughts (none published yet), but this sort of "precise influence" is something I call "inferential control". It allows you to maximize your impact given your action bottleneck, but this sort of optimization is "brittle". If something unknown unknown happens, the plan you've injected breaks instantly and gracelessly, because the fundamental assumptions on which its functionality relied – the pathways by which it meant to implement its objective – turn out to be invalid.

It sort of naturally favours arithmetic utility maximization over geometric utility maximization. By taking actions that'd only work if your predictions and models are true, you're basically sacrificing your selves living in the timelines that you're predicting to be impossible, and distributing their resources to the timelines you expect to find yourself in.

And this applies more and more the more "optimization capacity" you're trying to push through a narrow bottleneck. E. g., if you want to change the entire state of a giant environment through a tiny action-pinhole, you'd need to do it by exploiting some sort of "snowball effect"/"butterfly effect". Your tiny initial intervention would need to exploit some environmental structures to increase its size, and do so iteratively. That takes time (for whatever notion of "time" applies). You'd need to optimize over a longer stretch of environment-state changes, and your initial predictions need to be accurate for that entire stretch, because you'd have little ability to "steer" a plan that snowballed far beyond your pinhole's ability to control.

By contrast, increasing the size of your action bottleneck is pretty much the definition of "robust" optimization, i. e. geometric utility maximization. It improves your ability to control the states of all possible worlds you may find yourself in, minimizing the need for "brittle" inferential control. It increases your adaptability, basically, letting you craft a "message" comprehensively addressing any unpredicted crisis the environment throws at you, right in the middle of it happening.

RogerDearnaley

Dec 02, 2023

30

See my post Requirements for a STEM-capable AGI Value Learner for a suggestion of a natural impact regularizer on any approximately-Bayesian agent: large impact actions that could take it out-of-distribution decrease the certainty of its predictions, generally making the results its optimization worse, and anything sufficiently smart will be cautious about doing that.

4 comments, sorted by Click to highlight new comments since: Today at 3:14 PM

The argument depends on awareness that the canvas is at least a timeline (but potentially also various counterfactuals and frames), not a future state of the physical world in the vicinity of the agent at some point of time. Otherwise elegance asks planning to pave over the world to make it easier to reason about. In contrast, a timeline will have permanent scars from the paving-over that might be harder to reason through sufficiently beforehand than keeping closer to the status quo, or even developing affordances to maintain it.

Interestingly, this seems to predict that preference for "low impact" is more likely for LLM-ish things trained on human text (than for de novo RL-ish things or decision theory inspired agents), but for reasons that have nothing to do with becoming motivated to pursue human values. Instead, the relevant imitation is for ontology of caring about timelines, counterfactuals, and frames.

I think to some extent, "paving over everything" is also an illustration of how natural impact regularization != safety.

My point is that elegance of natural impact regularization takes different shapes for different minds, and paving over everything is only elegant for minds that care about the state of the physical world at some point in time, rather than the arc of history.

I think even if you care about the arc of history, paving over everything would still be selected for. Yes, there's the scar problem you mention, but it's not clear that it's strong enough to prevent it.