Conditions under which misaligned subagents can (not) arise in classifiers — AI Alignment Forum

x

Conditions under which misaligned subagents can (not) arise in classifiers — AI Alignment Forum

Core claim: Misaligned subagents are very unlikely to arise in a classification algorithm unless that algorithm is directly or indirectly (e.g. in a subtask) modeling interactions through time at a significant level of complexity.

Definition 1: Agent - a function from inputs and internal state (or memory) to an output / action and new internal state. Note that this includes things that would not usually be considered as "agents" - e.g. plants or bacteria. Also note that not all "agents" of this type have consistent (or even coherent) "goals"

This definition of agent might be considered too broad; the reason I have decided to use it is that I believe it covers basically everything that could be dangerous - if an AI is not an agent under this definition, then I think it is extremely likely that this AI would be safe.

Definition 2: A function that was selected by an optimization procedure has a misaligned subagent if it spawns a subprocess that is an agent whose "goals" are different from (and potentially in conflict with) the optimization criteria.

Example: Consider an optimization process that selects for functions that accurately predict human actions, and assume that this optimization process finds a function that does this prediction by creating an extremely accurate simulations of humans. These simulations would be misaligned subagents, since humans are agents and the goals of the simulations would likely be very different from "predict human actions accurately".

For brevity, let us abbreviate classifiers with misaligned subagents as CWMS. Note that I might use classifier a bit more broadly than the strict definition - for example, I may call certain more general question-answering machines "classifiers". I do not believe this directly affects the general argument.

Claim 1: An agent will "perform" "better" than a memoryless function given the same sequence of inputs only if (almost) every input is highly correlated with the previous input. To phrase this in a different way, having a "memory" only helps if your "memory" gives good evidence for either what is going to happen next or what to do next.

Claim / Assumption 2: For most optimization processes powerful enough to find a classifier that performs well, we have:

$P (Optimization process finds a CWMS) \approx \frac{Density of CWMS that "perform well"}{Density of all classifiers that "perform well"}$

There might be certain optimization processes that lean more towards CWMS (or toward classifiers without misaligned subagents), but I think this is a reasonable base assumption given our current level of information.

Claim 3: For a certain task, if most good classifiers for that task are CWMS, then there exists one or more subtasks that involve processing highly correlated inputs, and doing some of these subtasks very well is important for being a good classifier. This follows from Claims 1 and 2.

Conversely, if such key subtasks do not exist, most good classifiers will not be CWMS

Intuition 4: For most tasks which have key subtasks of the type mentioned in Claim 3, those subtasks are very likely to involve some sort of modeling how things change / interact through time (example: understanding the contents of a video or the meaning of a paragraph in a book require this type of modeling).

Intuition 5: There are a lot of interesting, difficult tasks where modeling how things change through time is not key for solving the task.

Claim / Intuition 6: Optimizers for the tasks mentioned in Intuition 5 have very low probability of finding a CWMS. This follows from Claims 2 and 3 and Intuition 4.

Thanks to Scott Garrabrant and Evan Hubinger for discussing some the ideas in this post with me.