Certainly if you just have access to a weaker policy, this doesn't make the problem any easier. If you could take a weak policy and amplify it into a stronger policy efficiently, then you could just repeatedly apply this policy-improvement operator to some very weak base policy (say, a neural net with random weights) to solve the full problem. (If you have a much stronger aligned base policy, eg. the human policy with short inputs and over a short time horizon; in that case this assumption is more powerful.) The more interesting assumption is that you have lots of time and compute, which does seem to have a lot of potential. I feel pretty optimistic that a human thinking for a long time could reach "superhuman performance" by our current standards; though of course capability amplification asks for a stronger guarantee: can we can do this in a particular structured way.

We say that A ⪰ B if we are at least as happy with policy A as with policy B(in any situation that we think might arise in practice).

This sounds like a partial order to me. But then:

C is reachable from A if there is a chain of policies in 𝒜 which starts at A and ends at C, and where each policy in the chain is no better than the amplification of the previous policy.

I interpret this as saying: If $(A, B)$ form part of the chain, then $B ≯ A^{+}$ . But I believe that the property we want is $A^{+} \geq B$ , which is a different condition if $\geq$ defines a partial ordering. Does that seem right to you?

I might rephrase it as "where the amplification of each policy in the chain is at least as good as the subsequent policy".

We say that C is reachable from A if:

A⁺ ⪰ C, where A⁺ is the amplification as described in the last section; or

There is an intermediate B ∈ 𝓐 which is reachable from A and which can reach C.

It took me a while to realize why you went with this definition. I thought you were going for a simple recursive definition, in which case you could define $C$ to be reachable from $A$ if $A \geq C$ , or if $C$ is reachable from $A^{+}$ . Equivalently, there is a chain of amplifications of $A$ such that the resulting policy dominates $C$ . The problem with this definition is that there isn't a corresponding notion of obstructions for my definition, because it isn't transitive. It is possible to have $B$ reachable from $A$ , and $C$ reachable from $B$ , but not $C$ reachable from $A$ .

On the other hand, I believe your definition is the transitive closure of the relation $R$ , where $(A, C) \in R$ iff $A^{+} \geq C$ , and so a notion of obstructions comes out naturally.

Analogously, we say that a function L : 𝓐 → ℝ is an obstruction if our amplification procedure cannot always increase L.

... to its maximal value in 𝓐. (Obvious, but worth saying.)

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

8

Capability amplification

8

Motivation

Example of capability amplification: answering questions

The general problem

Reachability

Obstructions

Relationship to value alignment

Why capability amplification seems feasible

Why capability amplification seems useful

What to do?

Theory

Experiment

Conclusion

Appendix: iterating amplification

Appendix: examples of capability amplification

Appendix: knowledge about humans

Appendix: an example obstruction