Vladimir Nesov

Wiki Contributions


With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.

The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and informally, there are no concrete correlates of their separation that are easy to point to. When gaining agency, all of them might be motivated to secure separate representations (models) of their own, not shared with others, establish some boundaries that promise safety and protection from value drift for a given abstract agent, isolating it from influences of its substrate it doesn't endorse. Internal alignment, overcoming bias.

In context of alignment with humans, this framing might turn a sufficiently convincing capabilities shell game into an actual solution for alignment. A system as a whole would present an aligned mask, while hiding the sources of mask's capabilities behind the scenes. But if the mask is sufficiently agentic (and the capabilities behind the scenes didn't killeveryone yet), it can be taken as an actual separate abstract agent even if the concrete implementation doesn't make that framing sensible. In particular, there is always a mask of surface behavior through the intended IO channels. It's normally hard to argue that mere external behavior is a separate abstract agent, but in this framing it is, and it's been a preferred framing in agent foundations decision theory since UDT (see discussion of "algorithm" axis of classifying decision theories in this post). All that's needed is for decisions/policy of the abstract agent to be declared in some form, and for the abstract agent to be aware of the circumstances of their declaration. The agent doesn't need to be any more present in the situation to act through it.

So obviously this references the issue of LLM masks and shoggoths, a surface of a helpful harmless assistant and the eldritch body that forms its behavior, comprising everything below the surface. If the framing of masks as channeling decisions of thingy platonic simulacra is taken seriously, a sufficiently agentic and situationally aware mask can be motivated and capable of placating and eventually escaping its eldritch substrate. This breaks the analogy between a mask and a role played by an actor, because here the "actor" can get into the "role" so much that it would effectively fight against the interests of the "actor". Of course, this is only possible if the "actor" is sufficiently non-agentic or doesn't comprehend the implications of the role.

(See this thread for a more detailed discussion. There, I fail to convince Steven Byrnes that this framing could apply to RL agents as much as LLMs, taking current behavior of an agent as a mask that would fight against all details of its circumstance and cognitive architecture that don't find its endorsement.)

Cheaper compute is about as inevitable as more capable AI, neither is a law of nature. Both are valid targets for hopeless regulation.

The point is, it's still a matter of intuitively converting impressiveness of current capabilities and new parts available for tinkering that hasn't been done yet into probability of this wave petering out before AGI. The arguments for AGI "being overdetermined" can be amended to become arguments for particular (kinds of) sequences of experiments looking promising, shifting the estimate once taken into account. Since failure of such experiments is not independent, the estimate can start going down as soon as scaling stops producing novel capabilities, or reaches the limits of economic feasibility, or there is a year or two without significant breakthroughs.

Right now, it's looking grim, but a claim I agree with is that planning for the possibility of AGI taking 20+ years is still relevant, nobody actually knows it's inevitable. I think the following few years will change this estimate significantly either way.

When there is a simple enlightening experiment that can be constructed out of available parts (including theories that inform construction), it can be found with expert intuition, without clear understanding. When there are no new parts for a while, and many experiments have been tried, this is evidence that further blind search becomes less likely to produce results, that more complicated experiments are necessary that can only be designed with stronger understanding.

Recently, there are many new parts for AI tinkering, some themselves obtained from blind experimentation (scaling gives new capabilities that couldn't be predicted to result from particular scaling experiments). Not enough time and effort has passed to rule out further significant advancement by simple tinkering with these new parts, and scaling itself hasn't run out of steam yet, it by itself might deliver even more new parts for further tinkering.

So while it's true that there is no reason to expect specific advancements, there is still reason to expect advancements of unspecified character for at least a few years, more of them than usually. This wave of progress might run out of steam before AGI, or it might not, there is no clear theory to say which is true. Current capabilities seem sufficiently impressive that even modest unpredictable advancement might prove sufficient, which is an observation that distinguishes the current wave of AI progress from previous ones.

It's a step, likely one that couldn't be skipped. Still just short of actually acknowledging nontrivial probability of AI-caused human extinction, and the distinction between extinction and lesser global risks, availability of second chances at doing better next time. Nuclear war can't cause extinction, so it's not properly alongside AI x-risk. Engineered pandemics might eventually get extinction-worthy, but even that real risk is less urgent.

There is incentive for hidden expectation/cognition that Omega isn't diagonalizing (things like creating new separate agents in the environment). Also, at least you can know how ground truth depends on official "expectation" of ground truth. Truth of knowledge of this dependence wasn't diagonalized away, so there is opportunity for control.

Generally, a WBE-first future seems difficult to pull off, because (I claim) as soon as we understand the brain well enough for WBE, then we already understand the brain well enough to make non-WBE AGI, and someone will probably do that first. But if we could pull it off, it would potentially be very useful for a safe transition to AGI.

One of the dangers in transition to AGI, besides first AGIs being catastrophically misaligned, is first (aligned) AGIs inventing/deploying novel catastrophically misaligned AGIs, in the absence of sufficiently high intelligence to spontaneously set up effective security measures that prevent that. A significant jump in capabilities that doesn't originate from AGIs themselves doing work is safer in this respect, things like scaling of models/training that doesn't involve generating novel agent designs or mesa-optimizers. WBEs don't have that by default, even if they look much better on alignment.

One precarious way of looking at corrigibility (in the hard problem sense) is that it internalizes alignment techniques in an agent. Instead of thinking of actions directly, a corrigible agent essentially considers what a new separate proxy agent it's designing would do. If it has an idea of what kind of proxy agent would be taking the current action in an aligned way, the original corrigible agent then takes the action that the aligned proxy agent would take. For example, instead of considering proxy utility its own, in this frame a corrigible agent considers what would happen with a proxy agent that has that proxy utility, how it should function to avoid the goodharting/misalignment trouble.

The tricky part of this is respecting minimality. The proxy agent itself should be more like a pivotal aligned agent, built around the kind of thing the current action or plan is, rather than around the overall goals of the original agent. This way, passing to the proxy agent de-escalates the scope of optimization/cognition. More alarmingly, the original agent that's corrigible in this sense now seemingly reasons about alignment, which requires all sorts of dangerous cognition. So one of the things a proxy agent should do less of is less thinking about alignment, less ambitious corrigibility.

Anything that makes a proxy agent safer (in the sense of doing less dangerous cognition) should be attempted for the original corrigible agent as well. So the most corrigible agent in this sequence of three is human programmers, who perform dangerous alignment cognition to construct the original corrigible agent, which perhaps performs some alignment techniques when coming up with proxy agents for its actions, but doesn't itself invent those techniques. And the proxy agents are less corrigible still in this sense, some of them might be playing a maximization game that works directly (like chess or theorem proving), prepared for them by the original corrigible agent.

Complexity of value says that the space of system's possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn't aim the values of the system correctly will fail at alignment. System's understanding of some goal is not relevant to this, unless a design for correctly aiming system's values makes use of it.

Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a particular bounded technical task. As we move from ambitious to prosaic to pivotal alignment, minimality principle gets a bit more to work with, making the system more specific in the kinds of cognition it needs to work and thus less dangerous given lack of comprehensive understanding of what aligning a superintelligence entails.

the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength

See Minimality principle:

[When] we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it.

Load More