Vojtech Kovarik

My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about "consequentionalist reasoning").


Formalising Catastrophic Goodhart

Wiki Contributions


Quick reaction:

  • I didn't want to use the ">1 billion people" formulation, because that is compatible with scenarios where a catastrophe or an accident happens, but we still end up controling the future in the end.
  • I didn't want to use "existential risk", because that includes scenarios where humanity survives but has net-negative effects (say, bad versions of Age of Em or humanity spreading factory farming across the stars).
  • And for the purpose of this sequence, I wanted to look at the narrower class of scenarios where a single misaligned AI/optimiser/whatever takes over and does its thing. Which probably includes getting rid of literally everyone, modulo some important (but probably not decision-relevant?) questions about anthropics and negotiating with aliens.

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?

(My current take is that for a majority reader, sticking to "literal extinction" is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)

I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).

That seems fair. For what it's worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn't as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.

(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)

It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?

My reaction to this is that: Actually, current LLMs do care about our preferences, and about their guardrails. It was never about getting some AI to care about our preferences. It is about getting powerful AIs to robustly care about our preferences. Where by "robustly" includes things like (i) not caring about other things as well (e.g., prediction accuracy), (ii) generalising correctly (e.g., not just maximising human approval), and (iii) not breaking down when we increase the amount of optimisation pressure a lot (e.g., will it still work once we hook it into future-AutoGPT-that-actually-works and have it run for a long time?).

Some examples of what would cause me to update are: If we could make LLMs not jailbreakable without relying on additional filters on input or output.

Nitpicky comment / edit request: The circle inversion figure was quite confusing to me. Perhaps add a note to it saying that solid green maps onto solid blue, red maps onto itself, and dotted green maps onto dotted blue. (Rather than colours mapping to each other, which is what I intuitively expected.)

Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.

E.g. Living in large groups such that it’s hard for a predator to focus on any particular individual; a zebra’s stripes.

Off-topic, but: Does anybody have a reference for this, or a better example? This is the first time I have heard this theory about zebras.

Two points that seem relevant here:

  1. To what extent are "things like LLMs" and "things like AutoGPT" very different creatures, with the latter sometimes behaving like a unitary agent?
  2. Assuming that the distinction in (1) matters, how often do we expect to see AutoGPT-like things?

(At the moment, both of these questions seem open.)

This made me think of "lawyer-speak", and other jargons.

More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)

I would like to point out one aspects of the "Vulnerable ML systems" scenario that the post doesn't discuss much: the effect on adversarial vulnerability on widespread-automation worlds.

Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).

But ultimately, I think none of these fits perfectly. So a longer, self-contained description is something like:

  • Consider the world where we automate more and more things using AI systems that have vulnerable components. Perhaps those vulnerabilities primarily come from narrow-purpose neural networks and foundation models. But some might also come from insecure software design, software bugs, and humans in the loop.
  • And suppose some parts of the economy/society will be designed more securely (some banks, intelligence services, planes, hopefully nukes)...while others just have glaring security holes.
  • A naive expectation would be that a security hole gets fixed if and only if there is somebody who would be able to exploit it. This is overly optimistic, but note that even this implies the existence of many vulnerabilities that would require stronger-than-existing level of capability to exploit. More realistically, the actual bar for fixing security holes will be "there might be many people who can exploit this, but it is not worth their opportunity cost". And then we will also not-fix all the holes that we are unaware of, or where the exploitation goes undetected.
    These potential vulnerabilities leave a lot of space for actual exploitation when the stakes get higher, or we get a sudden jump in some area of capabilities, or when many coordinated exploits become more profitable than what a naive extrapolation would suggest.

There are several potential threats that have particularly interesting interactions with this setting:

  • (A) Alignment scheme failure: An alignment scheme that would otherwise work fails due to vulnerabilities in the AI company training it. This seems the closest to what this post describes?
  • (B) Easier AI takeover: Somebody builds a misaligned AI that would normally be sub-catastrophic, but all of these vulnerabilities allow it to take over.
  • (C) Capitalism gone wrong: The vulnerabilities regularly get exploited, in ways that either go undetected or cause negative externalities that nobody relevant has incentives to fix. And this destroys a large portion of the total value.
  • (D) Malicious actors: Bad actors use the vulnerabilities to cause damage. (And this makes B and C worse.)
  • (E) Great-power war: The vulnerabilities get exploited during a great-power war. (And this makes B and C worse.)

Connection to Cases 1-3: All of this seems very related to how you distinguish between adversarial robustness gets solved before tranformative AI/after TAI/never. However, I would argue that TAI is not necessarily the relevant cutoff point here. Indeed, for Alignment failure (A) and Easier takeover (B), the relevant moment is "the first time we get an AI capable of forming a singleton". This might happen tomorrow, by the time we have automated 25% of economically-relevant tasks, half a year into having automated 100% of tasks, or possibly never. And for the remaining threat models (C,D,E), perhaps there are no single cutoff points, and instead the stakes and implications change gradually?

Implications: Personally, I am the most concerned about misaligned AI (A and B) and Capitalism gone wrong (C). However, perhaps risks from malicious actors and nation-state adversaries (D, E) are more salient and less controversial, while pointing towards the same issues? So perhaps advancing the agenda outlined in the post can be best done through focusing on these? [I would be curious to know your thoughts.]

Load More