(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)
It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?
My reaction to this is that: Actually, current LLMs do care about our preferences... (read more)
Nitpicky comment / edit request: The circle inversion figure was quite confusing to me. Perhaps add a note to it saying that solid green maps onto solid blue, red maps onto itself, and dotted green maps onto dotted blue. (Rather than colours mapping to each other, which is what I intuitively expected.)
Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.
E.g. Living in large groups such that it’s hard for a predator to focus on any particular individual; a zebra’s stripes.
Off-topic, but: Does anybody have a reference for this, or a better example? This is the first time I have heard this theory about zebras.
Two points that seem relevant here:
(At the moment, both of these questions seem open.)
This made me think of "lawyer-speak", and other jargons.
More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)
I would like to point out one aspects of the "Vulnerable ML systems" scenario that the post doesn't discuss much: the effect on adversarial vulnerability on widespread-automation worlds.
Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).
But ultimately, I think none of these fits perfectly. So a longer, self-contained description is somethi... (read more)
An idea for increasing the impact of this research: Mitigating the "goalpost moving" effect for "but surely a bit more progress on capabilities will solve this".
I suspect that many people who are sceptical of this issue will, by default, never sit down and properly think about this. If they did, they might make some falsifiable predictions and change their minds --- but many of them might never do that. Or perhaps many people will, but it will all happen very gradually, and we will never get a good enough "coordination point" that would allow us to take ne... (read more)
For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes.
Regarding the predictions, I want to make the following quibble: According to the definition above, one way of "solving" adversarial robustness is to make sure that nobody tries to catastrophically exploit the system in the first place. (In particular, exploitable AI that takes over the world is no longer exploitable.)
So, a lot with this definition rests on how do you distinguish between "cannot be ... (read more)
Yup, this is a very good illustration of the "talking past each other" that I think is happening with this line of research. (I mean with adversarial attacks on NNs in general, not just with Go in particular.) Let me try to hint at the two views that seem relevant here.
1) Hinting at the "curiosity at best" view: I agree that if you hotfix this one vulnerability, then it is possible we will never encounter another vulnerability in current Go systems. But this is because there aren't many incentives to go look for those vulnerabilities. (And it might even be... (read more)
My reaction to this is something like:
Academically, I find these results really impressive. But, uhm, I am not sure how much impact they will have? As in: it seems very unsurprising that something like this is possible for Go. And, also unsurprisingly, something like this might be possible for anything that involves neural networks --- at least in some cases, and we don't have a good theory for when yes/no. But also, people seem to not care. So perhaps we should be asking something else? Like, why is that people don't care? Suppose you managed to d... (read more)
When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool's errand. Around half the people told me they thought it was extremely unlikely I'd find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I'd done a survey (even an informal one) before conducting this research to get a better sense of people'... (read more)
Another piece of related work: Simon Zhuang, Dylan Hadfield-Mennel: Consequences of Misaligned AI.The authors assume a model where the state of the world is characterized by multiple "features". There are two key assumptions: (1) our utility is (strictly) increasing in each feature, so -- by definition -- features are things we care about (I imagine money, QUALYs, chocolate). (2) We have a limited budget, and any increase in any of the features always has a non-zero cost. The paper shows that: (A) if you are only allowed to tell your optimiser about a stri... (read more)
Incidentally, I am trying to come up with a "better" model for "this stuff", one that would have predictive power over reality. (As opposed to starting out with a clear bottom line.) No solutions yet, but I do have some thoughts. If other people are also actively working on this, I would be happy to talk.
I might be interpretting things wrong, but it seems to me that the paper is doing things the wrong way around. That is, (it seems to me that) the paper sets out to prove that Goodhart's law is an issue and picks a setting where this will be the case --- as opposed to picking a setting, then investigating whether/when Goodhart's law is an issue.
By this, I don't mean to say that the paper is bad; it is good. I merely mean to say that we should view it as a nice metaphor that formalises some intuitions about Goodhart's law, rather than as a model that is "cau... (read more)
My key point: I strongly agree with (my perception of) the intuitions behind this post, but I wouldn't agree with the framing. Or with the title. So I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.
To illustrate on an example: Suppose I want to use a metal rod for a particular thingy in a car. And there is some safety standard for this, and the metal rod meets this standard. And now suppose you instead have that same metal rod, except the safety standard does not exist. I expect most... (read more)
Adding the missing baseline scenario:
There is a bunch of open-source versions of AutoGPT that have been explicitly tasked with destroying the world (for the lolz). One day, an employe of one of the leading companies plugs the latest predictive model into this (to serve as an argument on twitter for the position that AI risks are overblown). And then we all die.
This post seems related to an exaggerated version of what I believe: Humans are so far from "agents trying to maximize utility" that to understand how to AI to humans, we should first understand what it means to align AI to finite-state machines. (Doesn't mean it's sufficient to understand the latter. Just that it's a prerequisite.)
As I wrote, going all the way to finite-state machines seems exaggerated, even as a starting point. However, it does seem to me that starting somewhere on that end of the agent<>rock spectrum is the better way to go about understanding human values :-). (At least given what we already know.)
Flagging confusion / potential disagreement:
I think only predicting humans is neither sufficient nor necessary for the results to be aligned / helpful / not doom. Insufficient because if misaligned AGI is already in control, or likely going to be in control later, predicting arbitrary existing humans seems unsafe. [Edit: I think this is very non- obvious and needs further supporting arguments.] Not necessary because it should be fine to predict any known-to-be-safe process. (As long as you do this in a well-founded manner / not predicting itself.)
Not that I expect it to make much difference, but: Maybe it would be good if texts like these didn't make it into the training data od future models.
Reiterating two points people already pointed out, since they still aren't fixed after a month. Please, actually fix them, I think it is important. (Reasoning: I am somewhat on the fence on how big weight to assign to the simulator theory, I expect so are others. But as a mathematician, I would feel embarrassed to show this post to others and admit that I take it seriously, when it contains so egregious errors. No offense meant to the authors, just trying to point at this as an impact-limiting factor.)
Proposition 1: This is false, and the proof is wrong. F... (read more)
Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.
But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)
> (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:
To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assi... (read more)
(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn't be viewed as negative judgement on the people involved :-).)
My ~2-hour reaction to the challenge:
(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their "alignment plan" make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making thi... (read more)
To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn't really safe.
As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about "officially approved actions" very loud. And run a secondary thread of thinking about "non-approved thoughts" in the background --- "quieter" and in a more fragmented manner, and killing it off everytime it made it into the inne... (read more)
One way in which the world seems brittle / having free energy AI could use to gain advantage:
We haven't figured out good communication practices for the digital age. We don't have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. ("Enough" is meant to stand for "a lot, but within reach of an early AGI". Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)
Re sharp left turn: Maybe I misunderstand the "sharp left turn" term, but I thought this just means a sudden extreme gain in capabilities? If I am correct, then I expect you might get "sharp left turn" with a simulator during training --- eg, a user fine-tunes it on one additional dataset, and suddenly FOOOM. (Say, suddenly it can simulate agents that propose takeover plans that would actually work, when previously they failed at this with identical prompting.)
One implication I see is that it if the simulator architecture becomes frequently used, it might ... (read more)
Explanation for my strong downvote/disagreement:Sure, in the ideal world, this post would have a much better scholarship.
In the actual world, there are tradeoffs between the number of posts and the quality of scholarship. The cost is both the time and the fact that doing literature review is a chore. If you demand good scholarship, people will write slower/less. With some posts this is a good thing. With this post, I would rather have an attrocious scholarship and 1% higher chance of the sequence having one more post in it. (Hypothetical example. I expect the real tradeoffs are less favourable.)
An attempted paraphrase, to hopefully-disentangle some claims:
Eliezer, list of AGI lethalities: pivotal acts are (necessarily?) "outside of the Overton window, or something".
Critch, preceding post: Strategies involving non-Overton elements are not worth it
Critch, this post: there are pivotal outcomes you can via a strategy with no non-Overton elements
Eliezer, this comment: the "AI immune system" example is not an example of a strategy with no non-Overton elements
Possible reading: Critch/the reader/Eliezer currently wouldn't be able to name a strategy to... (read more)
(Not very sure I understood your description right, but here is my take:)
However, I share the general sentiment behind your post --- I also don't understand... (read more)
Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.
I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you
My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X.
Thank you for pointing that out. In all these cases, I actually know that you "stated X", so this is not an impression I wanted to create. I added a note at the begging of the document to hopefully clarify this.
Perfect, that is indeed the diffeence. I agree with all of what you write here.
In this light, the reason for my objection is that I understand how we can make a commitment of the first type, but I have no clue how to make a commitment of the second type. (In our specific example, once demand unarmed is an option -- once SPI is in use -- the counterfactual world where there is only demand armed just seems so different. Wouldn't history need to go very differently? Perhaps it wouldn't even be clear what "you" is in that world?)
But I agree that with SDA-AGIs,... (read more)
That is -- I think* -- a correct way to parse it. But I don't think it false... uhm, that makes me genuinely confused. Let me try to re-rephrase, see if uncovers the crux :-).
You are in a world where most (1- Ɛ ) of the bandits demand unarmed when paired with a caravan commited to [responding to demand unarmed the same as it responds to demand armed] (and they demand armed against caravans without such commitment). The bandit population (ie, their strategies) either remains the same (for simplicity) or the strategies that led to more profit increase in r... (read more)
I agree with (1) and (2), in the same way that I would agree that "one boxing will work in some settings and fail to work in others" and "whether you should one box depends on the larger context it appears in". I'd find it weird to call this an "objection" to one boxing though.
I agree that (1)+(2) isn't significant enough to qualify as "an objection". I think that (3)+(4)+(my interpretation of it? or something?) further make me believe something like (2') below. And that seems like an objection to me.
(2') Whether or not it works-as-intended depends on... (read more)
I think I agree with your claims about commiting, AI designs, and self-modifying into two-boxer being stupid. But I think we are using a different framing, or there is some misunderstanding about what my claim is. Let me try to rephrase it:
(1) I am not claiming that SPI will never work as intended (ie, get adopted, don't change players' strategies, don't change players' "meta strategies"). Rather, I am saying it will work in some settings and fail to work in others.
(2) Whether SPI works-as-intended depends on the larger context it appears in. (Some example... (read more)
Humans and human institutions can't easily make credible commitments.
That seems right. (Perhaps with the exception of legal contracts, unless one of the parties is powerful enough to make the contract difficult to enforce.) And even when individual people in an institution have powerful commitment mechanisms, this is not the same as the institution being able to credible commit. For example, suppose you have a head of a state that threatens suicidal war unless X happens, and they are stubborn enough to follow up on it. Then if X happens, you might get a coup instead, thus avoiding the war.
It seems like your main objection arises because you view SPI as an agreement between the two players.
I would say that my main objection is that if you know that you will encounter SPI in situation X, you have an incentive to alter the policy that you will be using in X. Which might cause other agents to behave differently, possibly in ways that lead to the threat being carried out (which is precisely the thing that SPI aimed to avoid).
In the bandit case, suppose the caravan credibly commits to treating nerf guns identically to regular runs. And suppose t... (read more)
Thanks for pointing this out :-). Indeed, my original formulation is false; I agree with the "more likely to work if we formalise it" formulation.
the tl;dr version of the full report:
The surrogate goals (SG) idea proposes that an agent might adopt a new seemingly meaningless goal (such as preventing the existence of a sphere of platinum with a diameter of exactly 42.82cm or really hating being shot by a water gun) to prevent the realization of threats against some goals they actually value (such as staying alive) [TB1, TB2]. If they can commit to treating threats to this goal as seriously as threats to their actual goals, the hope is that the new goal gets threatened instead. In particular, the purp... (read more)
As a concrete tool for discussing AI risk scenarios, I wonder if it doesn't have too many parameters? Like you have to specify the map (at least locally), how far we can see, what will the impact be of this and that research...
I agree that the model does have quite a few parameters. I think you can get some value out of it already by being aware of what the different parameters are and, in case of a disagreement, identifying the one you disagree about the most.
>If we currently have some AI system x, we can ask which system are reachable from x
Thank you for the comment. As for the axes, the y-axis always denotes the desirability of the given value-system (except for Figure 1). And you are exactly right with the x-axis --- that is a multidimensional space of value-systems that we put on a line, because drawing this in 3D (well, (multi+1)-D :-) ) would be a mess. I will see if I can make it somewhat clearer in the post.
Much of the concern about AI systems is when they lack support for these kind of interventions, whether it be because they are too fast, too complex, or can outsmart the would-be intervening human trying to correct what they see as an error.
All of these possible causes for the lack of support are valid. I would like to add one more: when the humans that could provide this kind of support don't care about providing it or have incentives against providing it. For example, I could report a bug in some system, but this would cost me time and only benefit people I don't know, so I will happily ignore it :-).
As a game theorist, I completely endorse the proposed terminology. Just don't tell other game theorists... Sometimes, things get even worse when some people use the term "general sum games" to refer to games that are not constant-sum.
I like to imagine different games on a scale between completely adversarial and completely cooperative. With things in the middle being called "mixed-motive games".
I am usually reasonably good at translating from math to non-abstract intuitive examples...but I didn't have much success here. Do you have "in English, for simpletons" example to go with this? :-) (You know, something that uses apples and biscuits rather than English-but-abstract words like "there are many hidden variables mediating the interactions between observables" :D.)
Otherwise, my current abstract interpretation of this is something like: "There are detailed models, and those might vary a lot. And then there are very abstract models, which will be ... (read more)
Even if you keep the argumentation phase asymmetric, you might want to make the answering phase simultaneous or at least allow the second AI to give the same answer as the first AI (which can mean a draw by default).
This doesn't make for a very good training signal, but might have better equilibria.
I haven't yet thought about this in much detail, but here is what I have:
I will assume you can avoid getting "hacked" while overseeing the debate. If you don't assume that, then it might be important whether you can differentiate between arguments that are vs aren't relevant to the question at hand. (I suppose that it is much harder to get hacked when strictly sticking to a specific subject-matter topic. And harder yet if you are, e.g., restricted to answering in math proofs, which might be sufficient for some types of questions.)
As for the features of saf... (read more)
I agree with what Paul and Donald are saying, but the post was trying to make a different point.
Among various things needed to "make debate work", I see three separate sub-problems:
(A) Ensuring that "agents use words to get a human to select them as the winner; and that this is their only terminal goal" is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human's head to explode and their body falls on the reward button, this doesn't count.)
(B) Having already accomplished (A), ensu
if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want
if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want
That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you - hating me - believe it is poison, we will happily coordinate towards me eating it.
I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don't see how t
I think I understood the first three paragraphs. The AI "ramming a button to the human" clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario --- by preventing the AI from doing this (boxing), ensuring it doesn't want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase "assume, for the sake