All of VojtaKovarik's Comments + Replies

(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)

It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?

My reaction to this is that: Actually, current LLMs do care about our preferences... (read more)

Nitpicky comment / edit request: The circle inversion figure was quite confusing to me. Perhaps add a note to it saying that solid green maps onto solid blue, red maps onto itself, and dotted green maps onto dotted blue. (Rather than colours mapping to each other, which is what I intuitively expected.)

Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.

E.g. Living in large groups such that it’s hard for a predator to focus on any particular individual; a zebra’s stripes.

Off-topic, but: Does anybody have a reference for this, or a better example? This is the first time I have heard this theory about zebras.

2Lee Sharkey4mo
Here is a reference that supports the claim using simulations But I think you're right to flag it - other references don't really support it as the main reason for stripes.  

Two points that seem relevant here:

  1. To what extent are "things like LLMs" and "things like AutoGPT" very different creatures, with the latter sometimes behaving like a unitary agent?
  2. Assuming that the distinction in (1) matters, how often do we expect to see AutoGPT-like things?

(At the moment, both of these questions seem open.)

This made me think of "lawyer-speak", and other jargons.

More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)

I would like to point out one aspects of the "Vulnerable ML systems" scenario that the post doesn't discuss much: the effect on adversarial vulnerability on widespread-automation worlds.

Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).

But ultimately, I think none of these fits perfectly. So a longer, self-contained description is somethi... (read more)

An idea for increasing the impact of this research: Mitigating the "goalpost moving" effect for "but surely a bit more progress on capabilities will solve this".

I suspect that many people who are sceptical of this issue will, by default, never sit down and properly think about this. If they did, they might make some falsifiable predictions and change their minds --- but many of them might never do that. Or perhaps many people will, but it will all happen very gradually, and we will never get a good enough "coordination point" that would allow us to take ne... (read more)

For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes.

Regarding the predictions, I want to make the following quibble: According to the definition above, one way of "solving" adversarial robustness is to make sure that nobody tries to catastrophically exploit the system in the first place. (In particular, exploitable AI that takes over the world is no longer exploitable.)

So, a lot with this definition rests on how do you distinguish between "cannot be ... (read more)

Yup, this is a very good illustration of the "talking past each other" that I think is happening with this line of research. (I mean with adversarial attacks on NNs in general, not just with Go in particular.) Let me try to hint at the two views that seem relevant here.

1) Hinting at the "curiosity at best" view: I agree that if you hotfix this one vulnerability, then it is possible we will never encounter another vulnerability in current Go systems. But this is because there aren't many incentives to go look for those vulnerabilities. (And it might even be... (read more)

My reaction to this is something like:

Academically, I find these results really impressive. But, uhm, I am not sure how much impact they will have? As in: it seems very unsurprising[1] that something like this is possible for Go. And, also unsurprisingly, something like this might be possible for anything that involves neural networks --- at least in some cases, and we don't have a good theory for when yes/no. But also, people seem to not care. So perhaps we should be asking something else? Like, why is that people don't care? Suppose you managed to d... (read more)

When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool's errand. Around half the people told me they thought it was extremely unlikely I'd find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I'd done a survey (even an informal one) before conducting this research to get a better sense of people'... (read more)

i actually do expect this to happen, and importantly I think this result is basically of academic interest, primarily because it is probably known why this adversarial attack can have at all, and it's the large scale cycles of a game board. This is almost certainly going to be solved, due to new training, so I find it a curiosity at best.

Another piece of related work: Simon Zhuang, Dylan Hadfield-Mennel: Consequences of Misaligned AI.
The authors assume a model where the state of the world is characterized by multiple "features". There are two key assumptions: (1) our utility is (strictly) increasing in each feature, so -- by definition -- features are things we care about (I imagine money, QUALYs, chocolate). (2) We have a limited budget, and any increase in any of the features always has a non-zero cost. The paper shows that: (A) if you are only allowed to tell your optimiser about a stri... (read more)

Incidentally, I am trying to come up with a "better" model for "this stuff", one that would have predictive power over reality. (As opposed to starting out with a clear bottom line.) No solutions yet, but I do have some thoughts. If other people are also actively working on this, I would be happy to talk.

I might be interpretting things wrong, but it seems to me that the paper is doing things the wrong way around. That is, (it seems to me that) the paper sets out to prove that Goodhart's law is an issue and picks a setting where this will be the case --- as opposed to picking a setting, then investigating whether/when Goodhart's law is an issue.

By this, I don't mean to say that the paper is bad; it is good. I merely mean to say that we should view it as a nice metaphor that formalises some intuitions about Goodhart's law, rather than as a model that is "cau... (read more)

1Vojtech Kovarik7mo
Incidentally, I am trying to come up with a "better" model for "this stuff", one that would have predictive power over reality. (As opposed to starting out with a clear bottom line.) No solutions yet, but I do have some thoughts. If other people are also actively working on this, I would be happy to talk.

My key point: I strongly agree with (my perception of) the intuitions behind this post, but I wouldn't agree with the framing. Or with the title. So I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.

To illustrate on an example: Suppose I want to use a metal rod for a  particular thingy in a car. And there is some safety standard for this, and the metal rod meets this standard.[1] And now suppose you instead have that same metal rod, except the safety standard does not exist. I expect most... (read more)

4David Manheim7mo
  That seems great, I'd be very happy for someone to write this up more clearly. My key point was about people's claims and confidence about safety, and yes, clearly that was communicated less well than I hoped.

Adding the missing baseline scenario: There is a bunch of open-source versions of AutoGPT that have been explicitly tasked with destroying the world (for the lolz). One day, an employe of one of the leading companies plugs the latest predictive model into this (to serve as an argument on twitter for the position that AI risks are overblown). And then we all die.

This post seems related to an exaggerated version of what I believe: Humans are so far from "agents trying to maximize utility" that to understand how to AI to humans, we should first understand what it means to align AI to finite-state machines. (Doesn't mean it's sufficient to understand the latter. Just that it's a prerequisite.)

As I wrote, going all the way to finite-state machines seems exaggerated, even as a starting point. However, it does seem to me that starting somewhere on that end of the agent<>rock spectrum is the better way to go about understanding human values :-). (At least given what we already know.)

Flagging confusion / potential disagreement: I think only predicting humans is neither sufficient nor necessary for the results to be aligned / helpful / not doom. Insufficient because if misaligned AGI is already in control, or likely going to be in control later, predicting arbitrary existing humans seems unsafe. [Edit: I think this is very non- obvious and needs further supporting arguments.] Not necessary because it should be fine to predict any known-to-be-safe process. (As long as you do this in a well-founded manner / not predicting itself.)

Not that I expect it to make much difference, but: Maybe it would be good if texts like these didn't make it into the training data od future models.

Reiterating two points people already pointed out, since they still aren't fixed after a month. Please, actually fix them, I think it is important. (Reasoning: I am somewhat on the fence on how big weight to assign to the simulator theory, I expect so are others. But as a mathematician, I would feel embarrassed to show this post to others and admit that I take it seriously, when it contains so egregious errors. No offense meant to the authors, just trying to point at this as an impact-limiting factor.)

Proposition 1: This is false, and the proof is wrong. F... (read more)

Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.

But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)

> (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.

Either I misunderstand this or it seems incorrect. 

Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:

To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assi... (read more)

Makes sense. FWIW, based on Jan's comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn't seem like Jan believes in this plan; personally I don't believe in this plan.  I don't know how to link to the specific comment, but here somewhere. Also:   Your pessimism about iii still seems a bit off to me. I agree that if you were coordinating well between all the actors than yeah you could just hold off on AI assistants. But the actual decision the OpenAI alignment team is facing could be more like "use LLMs to help with alignment research or get left behind when ML research gets automated". If facing such choices I might produce a plan like theirs, but notably I would be much more pessimistic about it. When the universe limits you to one option, you shouldn't expect it to be particularly good. The option "everybody agrees to not build AI assistants and we can do alignment research first" is maybe not on the table, or at least it probably doesn't feel like it is to the alignment team at OpenAI. 

(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn't be viewed as negative judgement on the people involved :-).)

My ~2-hour reaction to the challenge:[1]

(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their "alignment plan" make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making thi... (read more)

Either I misunderstand this or it seems incorrect.  It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient. The use of AI assistants for alignment : capabilities doesn't have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest lead are safety conscious, I expect the ratio of alignment : capabilities research they produce to be much higher (compared to now) right before AGI. See here.
1Vojtech Kovarik1y
(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn't be viewed as negative judgement on the people involved :-).)

To the extent that we are using the existence of humans with inner monologue as evidence for the suitability of this approach to AGI, I think it also serves as evidence for why that approach isn't really safe.

As an example: If I try, I can control my inner monologue to some extent. For example, I could make my inner monologue about "officially approved actions" very loud. And run a secondary thread of thinking about "non-approved thoughts" in the background --- "quieter" and in a more fragmented manner, and killing it off everytime it made it into the inne... (read more)

One way in which the world seems brittle / having free energy AI could use to gain advantage: We haven't figured out good communication practices for the digital age. We don't have good collective epistemics. And we dont seem to be on track to have this solved in the next 20 years. As a result I expect that with enough compute and understanding of network science, and perhaps a couple more things, you could sabotage the whole civilization. ("Enough" is meant to stand for "a lot, but within reach of an early AGI". Heck, if Google somehow spent the next 5 years just on that, I would give them fair odds.)

Re sharp left turn: Maybe I misunderstand the "sharp left turn" term, but I thought this just means a sudden extreme gain in capabilities? If I am correct, then I expect you might get "sharp left turn" with a simulator during training --- eg, a user fine-tunes it on one additional dataset, and suddenly FOOOM. (Say, suddenly it can simulate agents that propose takeover plans that would actually work, when previously they failed at this with identical prompting.)

One implication I see is that it if the simulator architecture becomes frequently used, it might ... (read more)

2Victoria Krakovna1y
I agree that a sudden gain in capabilities can make a simulated agent undergo a sharp left turn (coming up with more effective takeover plans is a great example). My original question was about whether the simulator itself could undergo a sharp left turn. My current understanding is that a pure simulator would not become misaligned if its capabilities suddenly increase because it remains myopic, so we only have to worry about a sharp left turn for simulated agents rather than the simulator itself. Of course, in practice, language models are often fine-tuned with RL, which creates agentic incentives on the simulator level as well.  You make a good point about the difficulty of identifying dangerous models if the danger is triggered by very specific prompts. I think this may go both ways though, by making it difficult for a simulated agent to execute a chain of dangerous behaviors, which could be interrupted by certain inputs from the user. 

Explanation for my strong downvote/disagreement:
Sure, in the ideal world, this post would have a much better scholarship.

In the actual world, there are tradeoffs between the number of posts and the quality of scholarship. The cost is both the time and the fact that doing literature review is a chore. If you demand good scholarship, people will write slower/less. With some posts this is a good thing. With this post, I would rather have an attrocious scholarship and 1% higher chance of the sequence having one more post in it. (Hypothetical example. I expect the real tradeoffs are less favourable.)

An attempted paraphrase, to hopefully-disentangle some claims:

Eliezer, list of AGI lethalities: pivotal acts are (necessarily?) "outside of the Overton window, or something"[1].

Critch, preceding post: Strategies involving non-Overton elements are not worth it

Critch, this post: there are pivotal outcomes you can via a strategy with no non-Overton elements

Eliezer, this comment: the "AI immune system" example is not an example of a strategy with no non-Overton elements

Possible reading: Critch/the reader/Eliezer currently wouldn't be able to name a strategy to... (read more)

(Not very sure I understood your description right, but here is my take:)

  • I think your proposal is not explaining some crucial steps, which are in fact hard. In particular, I understood it as "you have AI which can give you blueprints for nano sized machines". But I think we already have some blueprints, this isn't an issue. How we assemble them is an issue.
  • I expect that there will be more issues like this that you would find if you tried writing the plan in more detail.

However, I share the general sentiment behind your post --- I also don't understand... (read more)

1Rob Bensinger2y
I think it would be!
Personally, the author believes that SPI might “add up to normality” --- that it will be a sort of reformulation of existing (informal) approaches used by humans, with similar benefits and limitations.
I'm a bit confused by this claim. To me it's a bit unclear what you mean by "adding up to normality". (E.g.: Are you claiming that A) humans in current-day strategic interactions shouldn't change their behavior in response to learning about SPIs (because 1) they are already using them or 2) doing things that are somehow equivalent to them)? Or are you
... (read more)
My other complaint is that in some places you state some claim X in a way that (to me) suggests that you think that Tobi Baumann or Vince and I (or whoever else is talking/writing about surrogate goals/SPIs) have suggested that X is false, when really Tobi, Vince and I are very much aware of X and have (although perhaps to an insufficient extent) stated X.

Thank you for pointing that out. In all these cases, I actually know that you "stated X", so this is not an impression I wanted to create. I added a note at the begging of the document to hopefully clarify this.

Perfect, that is indeed the diffeence. I agree with all of what you write here.

In this light, the reason for my objection is that I understand how we can make a commitment of the first type, but I have no clue how to make a commitment of the second type. (In our specific example, once demand unarmed is an option -- once SPI is in use -- the counterfactual world where there is only demand armed just seems so different. Wouldn't history need to go very differently? Perhaps it wouldn't even be clear what "you" is in that world?)

But I agree that with SDA-AGIs,... (read more)

3Rohin Shah2y
Yeah, I agree with all of that.

That is -- I think* -- a correct way to parse it. But I don't think it false... uhm, that makes me genuinely confused. Let me try to re-rephrase, see if uncovers the crux :-).

You are in a world where most (1- Ɛ ) of the bandits demand unarmed when paired with a caravan commited to [responding to demand unarmed the same as it responds to demand armed] (and they demand armed against caravans without such commitment). The bandit population (ie, their strategies) either remains the same (for simplicity) or the strategies that led to more profit increase in r... (read more)

5Rohin Shah2y
Maybe the difference is between "respond" and "treat". It seems like you are saying that the caravan commits to "[responding to demand unarmed the same as it responds to demand armed]" but doesn't commit to "[and I will choose my probability of resisting based on the equilibrium value in the world where you always demand armed if you demand at all]". In contrast, I include that as part of the commitment. When I say "[treating demand unarmed the same as it treats demand armed]" I mean "you first determine the correct policy to follow if there was only demand armed, then you follow that policy even if the bandits demand unarmed"; you cannot then later "instruct it to always resist" under this formulation (otherwise you have changed the policy and are not using SPI-as-I'm-thinking-of-it). I think the bandits should only accept commitments of the second type as a reason to demand unarmed, and not accept commitments of the first type, precisely because with commitments of the first type the bandits will be worse off if they demand unarmed than if they had demanded armed. That's why I'm primarily thinking of the second type.
I agree with (1) and (2), in the same way that I would agree that "one boxing will work in some settings and fail to work in others" and "whether you should one box depends on the larger context it appears in". I'd find it weird to call this an "objection" to one boxing though.

I agree that (1)+(2) isn't significant enough to qualify as "an objection". I think that (3)+(4)+(my interpretation of it? or something?) further make me believe something like (2') below. And that seems like an objection to me.

(2') Whether or not it works-as-intended depends on... (read more)

3Rohin Shah2y
Can you explain this? I'm currently parsing it as "If most of the bandits would demand unarmed conditional on the caravan having committed to treating demand unarmed as they would have treated demand armed, then you want to make the caravan always choose resist", but that seems false so I suspect I have misunderstood you.

I think I agree with your claims about commiting, AI designs, and self-modifying into two-boxer being stupid. But I think we are using a different framing, or there is some misunderstanding about what my claim is. Let me try to rephrase it:

(1) I am not claiming that SPI will never work as intended (ie, get adopted, don't change players' strategies, don't change players' "meta strategies"). Rather, I am saying it will work in some settings and fail to work in others.

(2) Whether SPI works-as-intended depends on the larger context it appears in. (Some example... (read more)

3Rohin Shah2y
I agree with (1) and (2), in the same way that I would agree that "one boxing will work in some settings and fail to work in others" and "whether you should one box depends on the larger context it appears in". I'd find it weird to call this an "objection" to one boxing though. I don't really agree with the emphasis on formalization in (3), but I agree it makes sense to consider specific real-world situations. (4) is helpful, I wasn't thinking of this situation. (I was thinking about cases in which you have a single agent AGI with a DSA that is tasked with pursuing humanity's values, which then may have to acausally coordinate with other civilizations in the multiverse, since that's the situation in which I've seen surrogate goals proposed.) I think a weird aspect of the situation in your "evolutionary dynamics" scenario is that the bandits cannot distinguish between caravans that use SPI and ones that don't. I agree that if this is the case then you will often not see an advantage from SPI, because the bandits will need to take a strategy that is robust across whether or not the caravan is using SPI or not (and this is why you get outcompeted by those who use more aggressive AIs). However, I think that in the situation you describe in (4) it is totally possible for you to build a reputation as "that one institution (caravan) that uses SPI", allowing potential "bandits" to threaten you with "Nerf guns", while threatening other institutions with "real guns". In that case, the bandits' best response is to use Nerf guns against you and real guns against other institutions, and you will outcompete the other institutions. (Assuming the bandits still threaten everyone at a close-to-equal rate. You probably don't want to make it too easy for them to threaten you.) Overall I think if your point is "applying SPI in real-world scenarios is tricky" I am in full agreement, but I wouldn't call this an "objection".
Humans and human institutions can't easily make credible commitments.

That seems right. (Perhaps with the exception of legal contracts, unless one of the parties is powerful enough to make the contract difficult to enforce.) And even when individual people in an institution have powerful commitment mechanisms, this is not the same as the institution being able to credible commit. For example, suppose you have a head of a state that threatens suicidal war unless X happens, and they are stubborn enough to follow up on it. Then if X happens, you might get a coup instead, thus avoiding the war.

It seems like your main objection arises because you view SPI as an agreement between the two players.

I would say that my main objection is that if you know that you will encounter SPI in situation X, you have an incentive to alter the policy that you will be using in X. Which might cause other agents to behave differently, possibly in ways that lead to the threat being carried out (which is precisely the thing that SPI aimed to avoid).

In the bandit case, suppose the caravan credibly commits to treating nerf guns identically to regular runs. And suppose t... (read more)

5Rohin Shah2y
The whole point is to commit not to do this (and then actually not do it)! It seems like you are treating "me" (the caravan) as unable to make commitments and always taking local improvements; I don't know why you would assume this. Yes, there are some AI designs which would self-modify to resist more (e.g. CDT with a certain logical time); seems like you should conclude those AI designs are bad. If you were going to use SPI you'd build AIs that don't do this self-modification once they've decided to use SPI. Do you also have this objection to one-boxing in Newcomb's problem? (Since once the boxes are filled you have an incentive to self-modify into a two-boxer?)

Thanks for pointing this out :-). Indeed, my original formulation is false; I agree with the "more likely to work if we formalise it" formulation.

5Caspar Oesterheld2y
Not very important, but: Despite having spent a lot of time on formalizing SPIs, I have some sympathy for a view like the following: > Yeah, surrogate goals / SPIs are great. But if we want AI to implement them, we should mainly work on solving foundational issues in decision and game theory with an aim toward AI. If we do this, then AI will implement SPIs (or something even better) regardless of how well we understand them. And if we don't solve these issues, then it's hopeless to add SPIs manually. Furthermore, believing that surrogate goals / SPIs work (or, rather, make a big difference for bargaining outcomes) shouldn't change our behavior much (for the reasons discussed in Vojta's post). On this view, it doesn't help substantially to understand / analyze SPIs formally. But I think there are sufficiently many gaps in this argument to make the analysis worthwhile. For example, I think it's plausible that the effective use of SPIs hinges on subtle aspects of the design of an agent that we might not think much about if we don't understand SPIs sufficiently well.

the tl;dr version of the full report:

The surrogate goals (SG) idea proposes that an agent might adopt a new seemingly meaningless goal (such as preventing the existence of a sphere of platinum with a diameter of exactly 42.82cm or really hating being shot by a water gun) to prevent the realization of threats against some goals they actually value (such as staying alive) [TB1, TB2]. If they can commit to treating threats to this goal as seriously as threats to their actual goals, the hope is that the new goal gets threatened instead. In particular, the purp... (read more)

As a concrete tool for discussing AI risk scenarios, I wonder if it doesn't have too many parameters? Like you have to specify the map (at least locally), how far we can see, what will the impact be of this and that research...

I agree that the model does have quite a few parameters. I think you can get some value out of it already by being aware of what the different parameters are and, in case of a disagreement, identifying the one you disagree about the most.

>If we currently have some AI system x, we can ask which system are reachable from x
... (read more)

Thank you for the comment. As for the axes, the y-axis always denotes the desirability of the given value-system (except for Figure 1). And you are exactly right with the x-axis --- that is a multidimensional space of value-systems that we put on a line, because drawing this in 3D (well, (multi+1)-D :-) ) would be a mess. I will see if I can make it somewhat clearer in the post.

Much of the concern about AI systems is when they lack support for these kind of interventions, whether it be because they are too fast, too complex, or can outsmart the would-be intervening human trying to correct what they see as an error.

All of these possible causes for the lack of support are valid. I would like to add one more: when the humans that could provide this kind of support don't care about providing it or have incentives against providing it. For example, I could report a bug in some system, but this would cost me time and only benefit people I don't know, so I will happily ignore it :-).

As a game theorist, I completely endorse the proposed terminology. Just don't tell other game theorists... Sometimes, things get even worse when some people use the term "general sum games" to refer to games that are not constant-sum.

I like to imagine different games on a scale between completely adversarial and completely cooperative. With things in the middle being called "mixed-motive games".

I am usually reasonably good at translating from math to non-abstract intuitive examples...but I didn't have much success here. Do you have "in English, for simpletons" example to go with this? :-) (You know, something that uses apples and biscuits rather than English-but-abstract words like "there are many hidden variables mediating the interactions between observables" :D.)

Otherwise, my current abstract interpretation of this is something like: "There are detailed models, and those might vary a lot. And then there are very abstract models, which will be ... (read more)

I recommend skipping to the next post. This post was kind of a stub, the next one explains the same idea better.

Even if you keep the argumentation phase asymmetric, you might want to make the answering phase simultaneous or at least allow the second AI to give the same answer as the first AI (which can mean a draw by default).

This doesn't make for a very good training signal, but might have better equilibria.

2Rohin Shah3y
Responded to this in my reply to Abram's comment.

I haven't yet thought about this in much detail, but here is what I have:

I will assume you can avoid getting "hacked" while overseeing the debate. If you don't assume that, then it might be important whether you can differentiate between arguments that are vs aren't relevant to the question at hand. (I suppose that it is much harder to get hacked when strictly sticking to a specific subject-matter topic. And harder yet if you are, e.g., restricted to answering in math proofs, which might be sufficient for some types of questions.)

As for the features of saf... (read more)

I agree with what Paul and Donald are saying, but the post was trying to make a different point.

Among various things needed to "make debate work", I see three separate sub-problems:

(A) Ensuring that "agents use words to get a human to select them as the winner; and that this is their only terminal goal" is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human's head to explode and their body falls on the reward button, this doesn't count.)

(B) Having already accomplished (A), ensu

... (read more)

if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want

That is a good point, and this can indeed happen. If I believe something is a piece of chocolate while you - hating me - believe it is poison, we will happily coordinate towards me eating it. I was assuming that the AIs are copies of each other, which would eliminate most of these cases. (The remaining cases would be when the two AIs somehow diverge during the debate. I totally don't see how t

... (read more)

I think I understood the first three paragraphs. The AI "ramming a button to the human" clearly is a problem and an important one at that. However, I would say it is one that you already need to address in any single-agent scenario --- by preventing the AI from doing this (boxing), ensuring it doesn't want to do it (???), or by using AI that is incapable of doing it (weak ML system). As a result, I view this issue (even in this two-agent case) as orthogonal to debate. In the post, this is one of the things that hides under the phrase "assume, for the sake

... (read more)
2Donald Hobson3y
The point of the last paragraph was that if you have 2 AI's that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don't want. If the AI is in a perfect box, then no human hears its debate. If its a sufficiently weak ML system, it won't do much of anything. For the ??? AI that doesn't want to get out, that would depend on how that worked. There might, or might not be some system consisting of fairly weak ML and a fairly weak box that is safe and still useful. It might be possible to use debate safely, but it would be with agents carefully designed to be safe in a debate, not arbitrary optimisers. Also, the debaters better be comparably smart.
Load More