Vojtech Kovarik

My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about "consequentionalist reasoning").

Wiki Contributions


Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.

E.g. Living in large groups such that it’s hard for a predator to focus on any particular individual; a zebra’s stripes.

Off-topic, but: Does anybody have a reference for this, or a better example? This is the first time I have heard this theory about zebras.

Two points that seem relevant here:

  1. To what extent are "things like LLMs" and "things like AutoGPT" very different creatures, with the latter sometimes behaving like a unitary agent?
  2. Assuming that the distinction in (1) matters, how often do we expect to see AutoGPT-like things?

(At the moment, both of these questions seem open.)

This made me think of "lawyer-speak", and other jargons.

More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)

I would like to point out one aspects of the "Vulnerable ML systems" scenario that the post doesn't discuss much: the effect on adversarial vulnerability on widespread-automation worlds.

Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).

But ultimately, I think none of these fits perfectly. So a longer, self-contained description is something like:

  • Consider the world where we automate more and more things using AI systems that have vulnerable components. Perhaps those vulnerabilities primarily come from narrow-purpose neural networks and foundation models. But some might also come from insecure software design, software bugs, and humans in the loop.
  • And suppose some parts of the economy/society will be designed more securely (some banks, intelligence services, planes, hopefully nukes)...while others just have glaring security holes.
  • A naive expectation would be that a security hole gets fixed if and only if there is somebody who would be able to exploit it. This is overly optimistic, but note that even this implies the existence of many vulnerabilities that would require stronger-than-existing level of capability to exploit. More realistically, the actual bar for fixing security holes will be "there might be many people who can exploit this, but it is not worth their opportunity cost". And then we will also not-fix all the holes that we are unaware of, or where the exploitation goes undetected.
    These potential vulnerabilities leave a lot of space for actual exploitation when the stakes get higher, or we get a sudden jump in some area of capabilities, or when many coordinated exploits become more profitable than what a naive extrapolation would suggest.

There are several potential threats that have particularly interesting interactions with this setting:

  • (A) Alignment scheme failure: An alignment scheme that would otherwise work fails due to vulnerabilities in the AI company training it. This seems the closest to what this post describes?
  • (B) Easier AI takeover: Somebody builds a misaligned AI that would normally be sub-catastrophic, but all of these vulnerabilities allow it to take over.
  • (C) Capitalism gone wrong: The vulnerabilities regularly get exploited, in ways that either go undetected or cause negative externalities that nobody relevant has incentives to fix. And this destroys a large portion of the total value.
  • (D) Malicious actors: Bad actors use the vulnerabilities to cause damage. (And this makes B and C worse.)
  • (E) Great-power war: The vulnerabilities get exploited during a great-power war. (And this makes B and C worse.)

Connection to Cases 1-3: All of this seems very related to how you distinguish between adversarial robustness gets solved before tranformative AI/after TAI/never. However, I would argue that TAI is not necessarily the relevant cutoff point here. Indeed, for Alignment failure (A) and Easier takeover (B), the relevant moment is "the first time we get an AI capable of forming a singleton". This might happen tomorrow, by the time we have automated 25% of economically-relevant tasks, half a year into having automated 100% of tasks, or possibly never. And for the remaining threat models (C,D,E), perhaps there are no single cutoff points, and instead the stakes and implications change gradually?

Implications: Personally, I am the most concerned about misaligned AI (A and B) and Capitalism gone wrong (C). However, perhaps risks from malicious actors and nation-state adversaries (D, E) are more salient and less controversial, while pointing towards the same issues? So perhaps advancing the agenda outlined in the post can be best done through focusing on these? [I would be curious to know your thoughts.]

An idea for increasing the impact of this research: Mitigating the "goalpost moving" effect for "but surely a bit more progress on capabilities will solve this".

I suspect that many people who are sceptical of this issue will, by default, never sit down and properly think about this. If they did, they might make some falsifiable predictions and change their minds --- but many of them might never do that. Or perhaps many people will, but it will all happen very gradually, and we will never get a good enough "coordination point" that would allow us to take needle-shifting actions.

I also suspect there are ways of making this go better. I am not quite sure what they are, but here are some ideas: Making and publishing surveys. Operationalizing all of this better, in particular with respect to the "how much does this actually matter?" aspect. Formulating some memorable "hypothesis" that makes it easier to refer to this in conversations and papers (cf "orthogonality thesis"). Perhaps making some proponents of "the opposing view" make some testable predictions, ideally some that can be tested with systems whose failures won't be catastrophic yet?

For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes.

Regarding the predictions, I want to make the following quibble: According to the definition above, one way of "solving" adversarial robustness is to make sure that nobody tries to catastrophically exploit the system in the first place. (In particular, exploitable AI that takes over the world is no longer exploitable.)

So, a lot with this definition rests on how do you distinguish between "cannot be exploited" and "will not be exploited".

And on reflection, I think that for some people, this is close to being a crux regarding the importance of all this research.

Yup, this is a very good illustration of the "talking past each other" that I think is happening with this line of research. (I mean with adversarial attacks on NNs in general, not just with Go in particular.) Let me try to hint at the two views that seem relevant here.

1) Hinting at the "curiosity at best" view: I agree that if you hotfix this one vulnerability, then it is possible we will never encounter another vulnerability in current Go systems. But this is because there aren't many incentives to go look for those vulnerabilities. (And it might even be that if Adam Gleave didn't focus his PhD on this general class of failures, we would never have encountered even this vulnerability.)

However, whether additional vulnerabilities exist seems like an entirely different question. Sure, there will only be finitely many vulnerabilities. But how confident are we that this cyclic-groups one is the last one? For example, I suspect that you might not be willing to give 1:1000 odds on whether we would encounter new vulnerabilities if we somehow spent 50 researcher-years on this.

But I expect that you might say that this does not matter, because vulnerabilities in Go do not matter much, and we can just keep hotfixing them as they come up?

2) And the other view seems to be something like: Yes, Go does not matter. But we were only using Go (and image classifiers, and virtual-environment football) to illustrate a general point, that these failures are an inherent part of deep learning systems. And for many applications, that is fine. But there will be applications where it is very much not fine (eg, aligning strong AIs, cyber-security, economy in the presence of malicious actors).

And at this point, some people might disagree and claim something like "this will go away with enough training". This seems fair, but I think that if you hold this view, you should make some testable predictions (and ideally ones that we can test prior to having superintelligent AI).

And, finally, I think that if you had this argument with people in 2015, many of them would have made predictions such as "these exploits work for image classifiers, but they won't work for multiagent RL". Or "this won't work for vastly superhuman Go".

Does this make sense? Assuming you still think this is just an academic curiosity, do you have some testable predictions for when/which systems will no longer have vulnerabilities like this? (Pref. something that takes fewer than 50 researcher years to test :D.)

My reaction to this is something like:

Academically, I find these results really impressive. But, uhm, I am not sure how much impact they will have? As in: it seems very unsurprising[1] that something like this is possible for Go. And, also unsurprisingly, something like this might be possible for anything that involves neural networks --- at least in some cases, and we don't have a good theory for when yes/no. But also, people seem to not care. So perhaps we should be asking something else? Like, why is that people don't care? Suppose you managed to demonstrate failures like this in settings X, Y, and Z --- would this change anything? And also, when do these failures actually matter? [Not saying they don't, just that we should think about it.]

To elaborate:

  • If you understand neural networks (and how Go algorithms use them), it should be obvious that these algorithms might in principle have various vulnerabilities. You might become more confident about this once you learn about adversarial examples for image classifiers or hear arguments like "feed-forward networks can't represent recursively-defined concepts". But in a sense, the possibility of vulnerabilities should seem likely to you just based on the fact that neural networks (unlike some other methods) come with no relevant worst-case performance guarantees. (And to be clear, I believe all of this indeed was obvious to the authors since AlphaGo came out.)
  • So if your application is safety-critical, security mindset dictates that you should not use an approach like this. (Though Go and many other domains aren't safety-critical, hence my question "when does this matter".)
  • Viewed from this perspective, the value added by the paper is not "Superhuman Go AIs have vulnerabilities" but "Remember those obviously-possible vulnerabilities? Yep, it is as we said, it is not too hard to find them".
  • Also, I (sadly) expect that reactions to this paper (and similar results) will mostly fall into one of the following two camps: (1) Well, duh! This was obvious. (2) [For people without the security mindset:] Well, probably you just missed this one thing with circular groups; hotfix that, and then there will be no more vulnerabilities. I would be hoping for reaction such as (3) [Oh, ok! So failures like this are probably possible for all neural networks. And no safety-critical system should rely on neural networks not having vulnerabilities, got it.] However, I mostly expect that anybody who doesn't already believe (1) and (3) will just react as (2).
  • And this motivates my point about "asking something else". EG, how do people who don't already believe (3) think about these things, and which arguments would they find persuasive? Is it efficient to just demonstrate as many of these failures as possible? Or are some failures more useful than others, or does this perhaps not help at all? Would it help with "signpost moving" if we first made some people commit to specific predictions (eg, "I believe scale will solve the general problem of robustness, and in particular I think AlphaZero has no such vulnerabilities").
  1. ^

    At least I remember thinking this when AlphaZero came out. (We did a small project in 2018 where we found a way to exploit AlphaZero in the tiny connect-four game, so this isn't just misremembering / hindsight bias.)

Another piece of related work: Simon Zhuang, Dylan Hadfield-Mennel: Consequences of Misaligned AI.
The authors assume a model where the state of the world is characterized by multiple "features". There are two key assumptions: (1) our utility is (strictly) increasing in each feature, so -- by definition -- features are things we care about (I imagine money, QUALYs, chocolate). (2) We have a limited budget, and any increase in any of the features always has a non-zero cost. The paper shows that: (A) if you are only allowed to tell your optimiser about a strict subset of the features, all of the non-specified features get thrown under the buss. (B) However, if you can optimise things gradually, then you can alternate which features you focus on, and somehow things will end up being pretty okay.


Personal note: Because of the assumption (2), I find the result (A) extremely unsurprising, and perhaps misleading. Yes, it is true that at the Pareto-frontier of resource allocation, there is no space for positive-sum interactions (ie, getting better on some axis must hurt us on some other axis). But the assumption (2) instead claims that positive-sum interactions are literally never possible. This is clearly untrue in the real-world, about things we care about.

That said, I find the result (B) quite interesting, and I don't mean to hate on the paper :-).

Load More