davidad

Worst-case thinking in AI alignment

Here's the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:

- Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. Christoph Dann, Tor Lattimore, Emma Brunskill. NIPS 2017.
- Defines a concept of "Uniform-PAC bound", which is roughly when -quantile-case episodic regret scales polynomially in .
- Proves that a Uniform-PAC bound implies:
- PAC bound
- Uniform high-probability regret bound
- Convergence to zero regret with high probability

- Constructs an algorithm, UBEV, that has a Uniform-PAC bound
- Empirically compares quite favorably to other algorithms with only PAC or regret bounds

- Policy Certificates: Towards Accountable Reinforcement Learning. Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill. ICML 2019.
- Defines an even stronger concept of "IPOC bound", which implies Uniform-PAC, and also outputs a certified per-episode regret bound along with each proposed action.
- Constructs an algorithm ORLC that has an IPOC-bound
- Empirically compares favorably to UBEV

- Revisiting Generalization for Deep Learning: PAC-Bayes, Flat Minima, and Generative Models. Gintare Dziugaite. December 2018 PhD thesis under Zoubin Ghahrmani.
- Lipschitz Lifelong Reinforcement Learning. Erwan Lecarpentier, David Abel, Kavosh Asadi, et al. AAAI 2021.
- Defines a pseudometric on the space of all MDPs
- Proves that the mapping from an MDP to its optimal Q-function is (pseudo-)Lipschitz
- Uses this to construct an algorithm LRMax that can transfer-learn from past similar MDPs while also being PAC-MDP

- Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. Jiafan He, Dongruo Zhou, Quanwaun Gu. NIPS 2021.
- Constructs an algorithm FLUTE that has a Uniform-PAC bound with a certain linearity assumption on the structure of the MDP being learned.

- Beyond No Regret: Instance-Dependent PAC RL. Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson. August 2021 preprint.
- Learning PAC-Bayes Priors for Probabilistic Neural Networks. María Pérez-Ortiz, Omar Rivasplata, Benjamin Guedj, et al. September 2021 preprint.
- Tigheter Risk Certificates for Neural Networks. María Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, Csaba Szepesvári. ICML 2021.
- PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, Andreas Krause. ICML 2021.

Worst-case thinking in AI alignment

Somewhere between worst-case and average-case performance is **quantile-case performance**, known in SRE circles as percentile latency and widely measured empirically in practice (but rarely estimated in theory). Formally, optimizing -quantile-case performance looks like (compare to my expressions below for other cases). My impression is that quantile-case is heavily underexplored in theoretical CS and also underused in ML, with the exceptions of PAC learning and VC theory.

Worst-case thinking in AI alignment

My interpretation of the NFL theorems is that solving the relevant problems under worst-case assumptions is **too easy**, so easy it's trivial: a brute-force search satisfies the criterion of worst-case optimality. So, that being settled, in order to make progress, we have to step up to average-case evaluation, which is harder.

(However, I agree that once we already need to do *some* averaging, making explicit and stripping down the statistical assumptions and trying to get closer to worst-case guarantees—without making the problem trivial again—is harder than just evaluating empirically against benchmarks.)

Worst-case thinking in AI alignment

To elaborate this formally,

- is best-case
- is worst-case
- is average-case

and are both "easier" monoids than essentially because of dominance relations; for any , there's going to be a single that dominates all others, in the sense that all other can be excluded from consideration and have no impact on the outcome. Whereas when calculating , the only that can be excluded are those outside the distribution's support.

is even easier than because it commutes with the outer ; not only is there a single that dominates all others, it doesn't necessarily even depend on (the problem can be solved as or ). As a concrete example, the best case for nearly any sorting algorithm is already-sorted input, whereas the worst case depends more on which algorithm is being examined.

Are limited-horizon agents a good heuristic for the off-switch problem?

I’m curious to dig into your example.

- Here’s an experiment that I could imagine uncovering such internal planning:
- make sure the corpus has no instances of a token “jrzxd”, then
- insert long sequences of “jrzxd jrzxd jrzxd … jrzxd” at random locations in the middle of sentences (sort of like introns),
- then observe whether the trained model predicts “jrzxd” with greater likelihood than its base rate (which we’d presume is because it’s planning to take some loss now in exchange for confidently predicting more “jrzxd”s to follow).

- I think this sort of behavior
*could*be coaxed out of an actor-critic model (with hyperparameter tuning, etc.), but not GPT-3. GPT-3 doesn’t have any pressure towards a Bellman-equation-satisfying model, where future reward influences current output probabilities. - I’m curious if you agree or disagree and what you think I’m missing.

ARC's first technical report: Eliciting Latent Knowledge

the actual good state is much harder to access than (e.g. because it requires achieving hard real-world goals)

That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to ), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like "zero incentive to tamper, and tampering seems complicated" fail here.

The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to .

While some s may indeed predict this via reasoning, not all s that behave this way would, for example an that internally modeled the tampering sequence of actions incorrectly as actually leading to (and didn't even model a distinct ). I think either:

- (A) it would be at least as apt to ascribe a confused model to as to ascribe one in which it "reasoned about the nature of the sensor tampering" (e.g. if a contemporary model-free RL robot did some sensor tampering, I'd probably ascribe to it the belief that the tampering actually led to ), or
- (B) would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in , or null my-actuators-are-disconnected actions) with higher probability in the future of than in the future of (making those states separable in ), or
- (C) If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then does get identified with , but some
*other*action sequence (of similar length) would lead from to , a state in which bizarre observations appear forever that would be extremely unlikely at any state in .

My main question is why those preceding states get mapped correctly.[...] The only way I see this working is if the AI can "abort" its plan at the last minute and demonstrate a hack.[...] you can't distinguish the "almost-tampered" state naturally since it's just one more state that has this property.

First, I don't think timing ("last minute", "almost-tampered") is critical here. If timing *were* critical, the 'breaker' could reparameterize the situation so that complicated plans can be encoded into single actions in ; then can go directly from the initial state to fully tampered with a guarding successor—no room for an "almost-tampered" state. In this case the *initial* state must already be detectable as bad. I imagine this detection working because the adversary (which is as strong as ) can exhibit an action that leads from to , which establishes as far from . In this case, of course, is a no-go, and the overseers probably need to shrink .

With a reasonably small action space, my intuition is this: In constructing and delegating to a successor which provides with delusions, there is a point of no return when the successor is "deployed" and begins to resist changes to its objectives. At any time before that point, could, with a relatively short action sequence (compared to constructing the successor at all), modify the successor's objectives so that it induces instead of . As the successor comes closer to being complete, the total number of actions the adversary would need in order to reach (to, as you say, "demonstrate a hack") decreases, so the distance between and increases. Thereby, the path to passes through detectably bad territory.

One genre of ways this could fail involves using encryption or cryptographic hashes (e.g. first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable. Such failure modes are in a natural correspondence with safety mechanisms that humans could use to ensure the objective specification in an executable AI is tamper-proof. My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead); if we can construct a counterexample it would be very useful for that other purpose!

(ii) even if that happens it doesn't look to me like the proposed loss actually captures that if there are other similarly-severe predictive errors in the human's model

I agree that regions of that have nothing to do with delusion/tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to 's ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but (a) much less conservative than typical impact-measure penalties, and (b) if can learn what's going on with these regions of and develop corresponding regions of , then the distance penalty would be replaced by 's informed evaluation of them.

ARC's first technical report: Eliciting Latent Knowledge

(Thanks for playing along with me as 'breaker'!) I agree that such an would have zero distance from the corresponding , but I have some counterpoints:

**This is a problem for ELK in general, to the extent it's a problem**(which I think is smallish-but-non-negligible). An with this property is functionally equivalent to an which*actually believes*that refers to the same state of the real world as . So the dynamics of 's world-model don't contain any latent knowledge of the difference at this point.- This seems to be against the ELK report's knowledge-criterion "There is a feature of the computation done by the AI which is robustly correlated with Z."
- The only way I can think of that ELK could claim to reliably distinguish from is by arguing that the only plausible way to get such an is via a training trajectory where some previous did treat differently from , and perform ELK monitoring at training checkpoints (in which case I don't see reason to expect my approach comes off worse than others).

**Such an****would not be incentivized by the model**. Assuming that rewards factor through , . So a policy that's optimized against the world-model wouldn't have enough selection pressure to find the presumably narrow and high-entropy path that would lead to the tampered state from the initial state (assuming that the initial state in the real world at deployment is tamper-free).**In the real world,****has****higher****expected loss in**. If all the sensor inputs are generated by simulating the human model, has completely blinded itself to potential threats that could disrupt the tampering and reveal a missing diamond. These real-world threats are independent of the threats that the tampering box would be stochastically simulating as part of the human model, either of which would produce observations with high loss. Thus, the real-world expected loss in is typically strictly higher than in .- This seems to be against the ELK report's knowledge-criterion "M(
*x*) produces outputs that receive a significantly lower loss in the real world, where A′ is the right answer to Q, than in some kind of counterfactual where A is a better answer." - The only way I can think of that could reach an internal state like without regret (with respect to the real-world dynamics) is by first constructing and delegating to an aligned successor, and
*only then*transitioning into a state like (and, effectively, completely forgetting about everything weird). Of course, if that has happened, it is too late for ELK to help anyway. We need to apply ELK to a whole plan in advance, not just to the final state.- EtA: My intuition here is similar to the one you pointed at in Alignment for advanced machine learning systems (page 14) insofar as I think reaching requires previously crossing state-space territory that was detectably unsafe. However, instead of trying to smooth out a utility-relevant representation as they suggest (so that appears to "remember" how weird was), I'm saying, it's okay if the ELK-informed utility estimate is equal to the human utility , because since is really bad, the safety system should avoid entering and thereby avoid being in a position to enter .

- This seems to be against the ELK report's knowledge-criterion "M(

ARC's first technical report: Eliciting Latent Knowledge

This is what we want to do, and intuitively you ought to be able to back out info about the hidden state, but it's not clear how to do so.

Here's an approach I just thought of, building on scottviteri's comment. Forgive me if there turns out to be nothing new here.

Supposing that the machine and the human are working with the same observation space () and action space (), then the human's model and the machine's model are both coalgebras of the endofunctor , therefore both have a canonical morphism into the terminal coalgebra of , (assuming that such an exists in the ambient category). That is, we can map and . Then, if we can define a distance function on with type , we can use these maps to define distances between human states and machine states, .

**How can we make use of a distance function?** Basically, we can use the distance function to define a kernel (e.g. ), and then use kernel regression to predict the utility of states in by averaging "nearby" states in , and then finally (and crucially) estimating the generalization error so that states from that aren't really near to *anywhere* in get big warning flags (and/or utility penalties for being outside a trust region).

**How to get such a distance function?** One way is to use (the category of complete metric spaces) as the ambient category, and instantiate as the Kantorovich monad. Crank-turning yields the formula

where is constrained to be a non-expansive map, i.e., it is subject to the condition . If is discrete, I think this is maybe equivalent to an adversarial game where the adversary chooses, for every possible and , a partition of and a next action, and optimizes the probability that sampled predictions from and will eventually predict observations on opposite sides of the partition. This distance function is canonical, but in some sense seems too strict: if knows more about the world than , then of course the adversary will be able to find an action policy that eventually leads the state into some region that can confidently predict with while finds it very unlikely (). In other words, even if two states are basically concordant, this distance function will consider them maximally distant if there exists any policy that eventually leads to a maximal breakdown of bisimulation. (Both the canonical character and the too-strict character are in common with metrics.)

Inspired by this kind of corecursion but seeking more flexibility, let's consider the induced metric on the type *itself*, namely the -norm , then build a contraction map on that space and apply the Banach fixed-point theorem to pick out a well-defined . For example,

We are now firmly in Abstract Dynamic Programming territory. The distance between two states is the maximum score achievable by an adversary playing an MDP with state space as the product , the initial state as the pair of states being compared, the one-stage reward as the divergence of predictions about observations between the two models, the dynamics as just the H and M dynamics evolving separately (but fed identical actions), and exponential discounting.

The divergence is a free parameter here, although it has to be bounded, but it doesn't have to be a metric. It could be attainable utility regret, or KL divergence, or Jensen-Shannon divergence, or Bhattacharyya distance, etc. (with truncation or softmax to keep them bounded); lots of potential for experimentation here.

davidad's Shortform

Among computational constraints, I think the most significant/fundamental are, in order,

- Semicomputability
- P (polynomial time)
- PSPACE
- Computability
- BPP
- NP
- (first-order hypercomputable)
- All the rest (BQP, PP, RP, EXPTIME, etc)

Agreed—although optimizing for the worst case is usually easier than optimizing for the average case,

satisficingfor the worst case is necessarily harder (and, in ML, typically impossible) than satisficing for the average case.