Vanessa Kosoy

AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda.

Wiki Contributions


Vanessa Kosoy's Shortform

"Corrigibility" is usually defined as the property of AIs who don't resist modifications by their designers. Why would we want to perform such modifications? Mainly it's because we made errors in the initial implementation, and in particular the initial implementation is not aligned. But, this leads to a paradox: if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn't it also be flawed in a way that destroys corrigibility?

In order to stop passing the recursive buck, we must assume some dimensions along which our initial implementation is not allowed to be flawed. Therefore, corrigibility is only a well-posed notion in the context of a particular such assumption. Seen through this lens, the Hippocratic principle becomes a particular crystallization of corrigibility. Specifically, the Hippocratic principle assumes the agent has access to some reliable information about the user's policy and preferences (be it through timelines, revealed preferences or anything else).

Importantly, this information can be incomplete, which can motivate altering the agent along the way. And, the agent will not resist this alteration! Indeed, resisting the alteration is ruled out unless the AI can conclude with high confidence (and not just in expectation) that such resistance is harmless. Since we assumed the information is reliable, and the alteration is beneficial, the AI cannot reach such a conclusion.

For example, consider an HDTL agent getting upgraded to "Hippocratic CIRL" (assuming some sophisticated model of relationship between human behavior and human preferences). In order to resist the modification, the agent would need a resistance strategy that (i) doesn't deviate too much from the human baseline and (ii) ends with the user submitting a favorable report. Such a strategy is quite unlikely to exist.

Soares, Tallinn, and Yudkowsky discuss AGI cognition

The total absence of obvious output of this kind from the rest of the "AI safety" field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors.

I am very confused by this comment. Everything discussed upwards of it seems to me like relatively mundane AI safety stuff? For example here I wrote about why generalization failures will be persistent, and ofc distribution shifts are widely discussed and the connection between distribution shifts and daemons / inner misalignment is also fairly well known.

Soares, Tallinn, and Yudkowsky discuss AGI cognition

Yudkowsky: 5. "Hiding" of thoughts, in a system with significant transparency, via an imaginative and reflective consequentialist that has deduced the existence of the transparency system and models how humans will react to thoughts exposed by the transparency system.

Tallinn: 5. hmm.. here i'm running into trouble (type mismatch error) again. i can imagine this in abstract (and perhaps incorrectly/anthropomorphisingly!), but would - at this stage - fail to code up anything like a gridworlds example. more research needed (TM) i guess :)

Yudkowsky: 5 - you're plausibly understanding this correctly, then, this is legit a lot harder to spec a gridworld example for (relative to my own present state of knowledge).

This can be modeled using what I dubbed Turing reinforcement learning (TRL). That is, you have a "core" RL agent connected to the environment and to an "envelope" external computer (i.e. a module that allows running programs and contains most of the computing power) that it regards as part of the environment.

Such a system learns how to program the envelope to make a more powerful agent, and this more powerful agent learns how to reprogram the envelope part of itself even better. So, the transparency system would catch the core having obviously malign thoughts, but the core will instead end up having thoughts like "I should run such-and-such program on the envelope for such-and-such abstract reasons" while the envelope contains opaque code that keeps optimizing itself to become more opaque, and all the direct plotting is inside the computations performed by the opaque code.

Vanessa Kosoy's Shortform

There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.

We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let's assume the offline learning setting. Domain D is represented by instance space , label space , distribution and loss function . Similarly, domain E is represented by instance space , label space , distribution and loss function . The distributions are initially unknown and we assume some prior over them: . The prior involves some correlation between D and E, hence learning about D tends to acquire information about E as well.

A learning algorithm for D is (receives a data sample and produces a label prediction function). A learning algorithm for E has access to knowledge generated by : . We can now consider zero-sum games of the following form: you choose an algorithm , the adversary looks at and chooses an algorithm , your payoff decreases with your expected loss and increases with the adversary's expected loss (e.g. it is given by for some parameter ). The expected losses are given by

Here are the sample sizes. The interesting case is or even .

Here's a very simple example. Suppose that , and is s.t. (i) each is assigned a persistent label sampled uniformly at random from independently of other instances (ii) information about the labels doesn't help with E but information about the distribution on does help with E. When we care only about the best we can do is memorize the samples, i.e. set to if and set it to otherwise. However, this would aid the adversary. Instead, we can set to if and set it to a coinflip otherwise. Now our loss is somewhat worse (but, for discrete it still goes to as goes to ) but the adversary gains no information from us!

It is also possible to ignore any knowledge we have about E and just try designing which simultaneously minimizes the mutual information between and and minimizes . Going to an even higher level of abstraction, this is similar to the following problem:

Let be a bipartite graph ( are the left vertices, are the right vertices, are the edges) and a distribution on . Find s.t. (i) for any , and (ii) if we sample from then the mutual information between and is minimal. That is, we are minimizing the following:

It would be interesting to understand the computational complexity of this problem (and/or of relaxations when we're allowed to approximate).

Finally, it is interesting to also impose computational complexity constraints on our (but perhaps not on : obfuscating the learned representation means the knowledge about E is inaccessible from outside but might be still exploitable by the AI itself), in which case we would split it into a representation space , a training algorithm algorithm and a prediction algorithm (both of which have to lie in some low complexity class e.g. ), whereas the signature of becomes .

Vanessa Kosoy's Shortform

I don't understand what you're saying here.

For debate, goodharting means producing an answer which can be defended successfully in front of the judge, even in the face of an opponent pointing out all the flaws, but which is nevertheless bad. My assumption here is: it's harder to produce such an answer than producing a genuinely good (and defensible) answer. If this assumption holds, then there is a range of quantilization parameters which yields good answers.

For the question of "what is a good plan to solve AI risk", the assumption seems solid enough since we're not worried about coming across such deceptive plans on our own, and it's hard to imagine humans producing one even on purpose. To the extent our search for plans relies mostly on our ability to evaluate arguments and find counterarguments, it seems like the difference between the former and the latter is not great anyway. This argument is especially strong if we use human debaters as baseline distribution, although in this case we are vulnerable to same competitiveness problem as amplified-imitation, namely that reliably predicting rich outputs might be infeasible.

For the question of "should we continue changing the quantilization parameter", the assumption still holds because the debater arguing to stop at the given point can win by presenting a plan to solve AI risk which is superior to continuing to change the parameter.

Vanessa Kosoy's Shortform

Epistemic status: most elements are not new, but the synthesis seems useful.

Here is an alignment protocol that I call "autocalibrated quantilzed debate" (AQD).

Arguably the biggest concern with naive debate[1] is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of "arguments and counterargument" doesn't make sense anymore. Let's call utterances that have this property "Lovecraftian". To counter this, I suggest using quantilization. Quantilization postulates that the true utility function is the difference between the proxy utility function and an unknown adversarial cost function with bounded expected value w.r.t. some base distribution. For debate, we can imagine the cost function to measure the extent to which a particular utterance is Lovecraftian. It also protects us from Cartesian daemons. There are several candidates for the base distribution:

  • Uniform
  • Some sort of bounded simplicity prior. This amounts to postulating that a Lovecraftian utterance is unlikely to be produced by a randomized algorithm of low description and computational complexity.
  • The distribution of utterances produced by human debaters, acquired via imitation learning (this requires hardening[2] against daemons)

The issue with this is that it might be hard to choose the quantilization parameter. An error in one direction produces a system which is too weak for pivotal acts. An error in the other direction produces an unsafe system. However, it seems likely that there is some range of parameters which is okay: arguably, the optimization power needed to produce a natural language description of a pivotal act[3] is not as big as the optimization power required for Lovecraftian utterances. Indeed, it seems unlikely we could design Lovecraftian utterances ourselves in the foreseeable future, so if solving alignment is not easier then it's hopeless anyway.

To address this, I propose the following "autocalibration" procedure. Let be a sequence of quantilization parameters, starting from extremely aggressive (definitely safe) and then becoming gradually less aggressive (stronger optimization, weaker safety) s.t. we can be reasonably sure it doesn't skip over the target range. For each , we consider an -quantilized debate about the question "is replacing -quantilized debate by -quantilized debate a good idea?". We can then imagine executing this sequence of debates until the point where the answer switches from "yes" to "no".

Now, this sequence might be impractically time consuming. To address this, we can introduce a prediction system which predicts the outcomes of debates without having to actually run them. Notice that this is more viable than e.g. amplifying human imitation via prediction, because in this case we are predicting a low-dimensional signal whereas with imitation we have a high-dimensional signal with a much lower-dimensional but unknown subspace of relevant features. Ofc this system has to be hardened[2:1] against daemons. Once the optimal quantilization parameter has been determined, we can use it for a debate about the pivotal act.

There are various further nuances that should be considered:

  • Various ways to train the debaters can be considered, e.g. starting by generating good models of human judges[4] by passive observation and fine-tuning via actual debate.
  • The quantilized game is not exactly zero-sum so we need to rule out superrational cooperation, either by designing the debaters appropriately or via a unitary AI that is designed to search for Nash equilibria.
  • Instead of a discrete sequence, we can imagine continuous fine-tuning of the parameter i.e. finding the most aggressive quantilization parameter which is self-endorsing.
  • Various way to operationalize and structure the debates might be considered e.g. the 4 player system.

Finally, we also need to deal with non-Cartesian daemons. For this, let's consider a "Turing" setting, i.e. a machine learning "core" connected to an external computer "envelope". We assume that the core is not powerful enough to host a non-Cartesian daemon all by itself. We then protect ourselves by applying quantilization to the envelope actions as well. We might also be able to design a more detailed prior about dangerous programs than about Lovecraftian utterances. This requires to autocalibrate at least two parameters, which seems doable: for example, increase both parameters together until you hit the Pareto frontier, then optimize along the frontier. A weakness in this approach is that, for very aggressive quantilization the prediction system will become too weak to be useful, but perhaps this can be compensated by the more detailed prior.

  1. Without anything like factored cognition, which I am skeptical about. ↩︎

  2. To deal with Cartesian daemons we (i) only make predictions about the counterfactual in which the prediction wasn't made (ii) only produce predictions when they pass a confidence threshold (iii) use a physicalist prior (to be defined in a future post). Non-Cartesian daemons are addressed below. ↩︎ ↩︎

  3. Including instructions for designing a different aligned AI. ↩︎

  4. This entire approach is the polar opposite's of MIRI's "avoid human models" strategy, nevertheless it seems viable. ↩︎

Ngo and Yudkowsky on AI capability gains

As an aside, I think of FDT as being roughly in the same category: well-defined in Newcomb's problem and with exact duplicates, but reliant on vague intuitions to generalise to anything else.

FDT was made rigorous by infra-Bayesianism, at least in the pseudocausal case.

Vanessa Kosoy's Shortform

The idea comes from this comment of Eliezer.

Class II or higher systems might admit an attack vector by daemons that infer the universe from the agent's source code. That is, we can imagine a malign hypothesis that makes a treacherous turn after observing enough past actions to infer information about the system's own source code and infer the physical universe from that. (For example, in a TRL setting it can match the actions to the output of a particular program for envelope.) Such daemons are not as powerful as malign simulation hypotheses, since their prior probability is not especially large (compared to the true hypothesis), but might still be non-negligible. Moreover, it is not clear whether the source code can realistically have enough information to enable an attack, but the opposite is not entirely obvious.

To account for this I propose the designate class I systems which don't admit this attack vector. For the potential sense, it means that either (i) the system's design is too simple to enable inferring much about the physical universe, or (ii) there is no access to past actions (including opponent actions for self-play) or (iii) the label space is small, which means an attack requires making many distinct errors, and such errors are penalized quickly. And ofc it requires no direct access to the source code.

We can maybe imagine an attack vector even for class I systems, if most metacosmologically plausible universes are sufficiently similar, but this is not very likely. Nevertheless, we can reserve the label class 0 for systems that explicitly rule out even such attacks.

Ngo and Yudkowsky on AI capability gains

Comment after reading section 5.3:

Eliezer: What's an example of a novel prediction made by the notion of probability?

Richard: Most applications of the central limit theorem.

Eliezer: Then I should get to claim every kind of optimization algorithm which used expected utility, as a successful advance prediction of expected utility?


Richard: These seem better than nothing, but still fairly unsatisfying, insofar as I think they are related to more shallow properties of the theory.

This exchange makes me wonder whether Richard would accept the successes of reinforcement learning as "predictions" of the kind he is looking for? Because RL is essentially the straightforward engineering implementation of "expected utility theory".

Ngo and Yudkowsky on alignment difficulty

Comment after reading section 3:

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yudkowsky and I seem to agree that "do a pivotal act directly" is not something productive for us to work on, but "do alignment" research is something productive for us to work on. Therefore, there exists some range of AI capabilities which allow for superhuman alignment research but not for pivotal acts. Maybe this range is so narrow that in practice AI capability will cross it very quickly, or maybe not.

Moreover, I believe that there are trade-offs between safety and capability. This not only seems plausible, but actually shows up in many approach to safety (quantilization, confidence thresholds / consensus algorithms, homomorphic encryption...) Therefore, it's not safe to assume that any level of capability sufficient to pose risk (i.e. for a negative pivotal act) is also sufficient for a positive pivotal act.

Yudkowsky seems to claim that aligning an AI that does further alignment research is just too hard, and instead we should be designing AIs that are only competent in a narrow domain (e.g. competent at designing nanosystems but not at manipulating humans). Now, this does seem like an interesting class of alignment strategies, but it's not the only class.

One class of alignment strategies (which in particular Christiano wrote a lot about) compatible with bootstrapping is "amplified imitation of users" (e.g. IDA but I don't want to focus on IDA too much because of certain specifics I am skeptical about). This is potentially vulnerable to attack from counterfactuals plus the usual malign simulation hypotheses, but is not obviously doomed. There is also a potential issue with capability: maybe predicting is too hard if you don't know which features are important to predict and which aren't.

Another class of alignment strategies (which in particular Russel often promotes) compatible with boostrapping is "learn what the user wants and find a plan to achieve it" (e.g. IRL/CIRC etc). This is hard because it requires formalizing "what the user wants" but might be tractable via something along the lines of the AIT definition of intelligence. Making it safe probably requires imposing something like the Hippocratic principle, which, if you think through the implications, pulls it in the direction of the "superimitation" class. But, this might avoid superimitation's capability issues.

It could be that "restricted cognition" will turn out to be superior to both superimitation and value learning, but it seems far from a slam dunk at this point.

Load More