Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.

Rohin Shah's Comments

Wireheading and discontinuity

Unfortunately, discontinuities are common in any real system, e.g. much of robotics is in figuring out how to deal with contact forces (e.g. when picking up objects) because of the discontinuities that arise. (A term to look for is "hybrid systems".)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

I don't think "soft/slow takeoff" has a canonical meaning -- some people (e.g. Paul) interpret it as not having discontinuities, while others interpret it as capabilities increasing slowly past human intelligence over (say) centuries (e.g. Superintelligence). If I say "slow takeoff" I don't know which one the listener is going to hear it as. (And if I had to guess, I'd expect they think about the centuries-long version, which is usually not the one I mean.)

In contrast, I think "AI risk" has a much more canonical meaning, in that if I say "AI risk" I expect most listeners to interpret it as accidental risk caused by the AI system optimizing for goals that are not our own.

(Perhaps an important point is that I'm trying to communicate to a much wider audience than the people who read all the Alignment Forum posts and comments. I'd feel more okay about "slow takeoff" if I was just speaking to people who have read many of the posts already arguing about takeoff speeds.)

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
I ask because you're one of the most prolific participants here but don't fall into one of the existing "camps" on AI risk for whom I already have good models for.

Seems right, I think my opinions fall closest to Paul's, though it's also hard for me to tell what Paul's opinions are. I think this older thread is a relatively good summary of the considerations I tend to think about, though I'd place different emphases now. (Sadly I don't have the time to write a proper post about what I think about AI strategy -- it's a pretty big topic.)

The current situation seems to be that we have two good (relatively clear) terms "technical accidental AI risk" and "AI-caused x-risk" and the dispute is over what plain "AI risk" should be shorthand for. Does that seem fair?

Yes, though I would frame it as "the ~5 people reading these comments have two clear terms, while everyone else uses a confusing mishmash of terms". The hard part is in getting everyone else to use the terms. I am generally skeptical of deciding on definitions and getting everyone else to use them, and usually try to use terms the way other people use terms.

In other words I don't think this is strong evidence that all 4 people would endorse defining "AI risk" as "technical accidental AI risk". It also seems notable that I've been using "AI risk" in a broad sense for a while and no one has objected to that usage until now.

Agreed with this, but see above about trying to conform with the way terms are used, rather than defining terms and trying to drag everyone else along.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
It seems worth clarifying that you're only optimistic about certain types of AI safety problems.

Tbc, I'm optimistic about all the types of AI safety problems that people have proposed, including the philosophical ones. When I said "all else equal those seem more likely to me", I meant that if all the other facts about the matter are the same, but one risk affects only future people and not current people, that risk would seem more likely to me because people would care less about it. But I am optimistic about the actual risks that you and others argue for.

That said, over the last week I have become less optimistic specifically about overcoming race dynamics, mostly from talking to people at FHI / GovAI. I'm not sure how much to update though. (Still broadly optimistic.)

it seems that when you wrote the title of this newsletter "Why AI risk might be solved without additional intervention from longtermists" you must have meant "Why some forms of AI risk ...", or perhaps certain forms of AI risk just didn't come to your mind at that time.

It's notable that AI Impacts asked for people who were skeptical of AI risk (or something along those lines) and to my eye it looks like all four of the people in the newsletter independently interpreted that as accidental technical AI risk in which the AI is adversarially optimizing against you (or at least that's what the four people argued against). This seems like pretty strong evidence that when people hear "AI risk" they now think of technical accidental AI risk, regardless of what the historical definition may have been. I know certainly that is my default assumption when someone (other than you) says "AI risk".

I would certainly support having clearer definitions and terminology if we could all agree on them.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
To the extent that we expect strong warning shots and ability to avoid building AGI upon receiving such warning shots, this seems like an argument for researchers/longtermists to work on / advocate for safety problems beyond the standard of "AGI is not trying to deceive us or work against us" (because that standard will likely be reached anyway). Do you agree?


Some types of AI x-risk don't affect everyone though (e.g., ones that reduce the long term value of the universe or multiverse without killing everyone in the near term).

Agreed, all else equal those seem more likely to me.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
What do you expect the ML community to do at that point?

It depends a lot on the particular warning shot that we get. But on the strong versions of warning shots, where there's common knowledge that building an AGI runs a substantial risk of destroying the world, yes, I expect them to not build AGI until safety is solved. (Not to the standard you usually imagine, where we must also solve philosophical problems, but to the standard I usually imagine, where the AGI is not trying to deceive us or work against us.)

This depends on other background factors, e.g. how much the various actors think they are value-aligned vs. in zero-sum competition. I currently think the ML community thinks they are mostly but not fully value-aligned, and they will influence companies and governments in that direction. (I also want more longtermists to be trying to build more common knowledge of how much humans are value aligned, to make this more likely.)

I worry about a parallel with the "energy community"

The major disanalogy is that catastrophic outcomes of climate change do not personally affect the CEOs of energy companies very much, whereas AI x-risk affects everyone. (Also, maybe we haven't gotten clear and obvious warning shots?)

(compared to which, the disasters that will have occurred by then may well seem tolerable by comparison), and given probable disagreements between different experts about how serious the future risks are

I agree that my story requires common knowledge of the risk of building AGI, in the sense that you need people to predict "running this code might lead to all humans dying", and not "running this code might lead to <warning shot effect>". You also need relative agreement on the risks.

I think this is pretty achievable. Most of the ML community already agrees that building an AGI is high-risk if not done with some argument for safety. The thing people tend to disagree on is when we will get AGI and how much we should work on safety before then.

Appendix: how a subagent could get powerful

Flo's summary for the Alignment Newsletter:

This post argues that regularizing an agent's impact by <@attainable utility@>(@Towards a New Impact Measure@) can fail when the agent is able to construct subagents. Attainable utility regularization uses auxiliary rewards and penalizes the agent for changing its ability to get high expected rewards for these to restrict the agent's power-seeking. More specifically, the penalty for an action is the absolute difference in expected cumulative auxiliary reward between the agent either doing the action or nothing for one time step and then optimizing for the auxiliary reward.
This can be circumvented in some cases: If the auxiliary reward does not benefit from two agents instead of one optimizing it, the agent can just build a copy of itself that does not have the penalty, as doing this does not change the agent's ability to get a high auxiliary reward. For more general auxiliary rewards, an agent could build another more powerful agent, as long as the powerful agent commits to balancing out the ensuing changes in the original agent's attainable auxiliary rewards.

Flo's opinion:

I am confused about how much the commitment to balance out the original agent's attainable utility would constrain the powerful subagent. Also, in the presence of subagents, it seems plausible that attainable utility mostly depends on the agent's ability to produce subagents of different generality with different goals: If a subagent that optimizes for a single auxiliary reward was easier to build than a more general one, building a general powerful agent could considerably decrease attainable utility for all auxiliary rewards, such that the high penalty rules out this action.
Writeup: Progress on AI Safety via Debate

Planned summary for the Alignment Newsletter:

This post reports on work done on creating a <@debate@>(@AI safety via debate@) setup that works well with human players. In the game, one player is honest (i.e. arguing for the correct answer) and one is malicious (i.e. arguing for some worse answer), and they play a debate in some format, after which a judge must decide which player won the debate. They are using Thinking Physics questions for these debates, because they involve questions with clear answers that are confusing to most people (the judges) but easy for some experts (the players).
Early freeform text debates did not work very well, even with smart, motivated judges. The malicious player could deflect on questions they didn't want to answer, e.g. by claiming that the question was ambiguous and redirecting attention by asking new questions. In addition, when the malicious player got to go first and give an incorrect "framework" for finding the answer, and then made individually true claims to "fill in" the framework, it was hard for the honest player to rebut it. So, they moved to a framework without such asymmetries: both players gave a claim (simultaneously), both gave constructive arguments, and both rebutted the other's arguments. In addition, part of the appeal of debate is that the agents can "zoom in" on the particular disagreement they have, and the judge need only analyze the smallest disagreement in order to declare an overall winner.
This suggests the following setup: players simultaneously provide an answer supported with subclaims. Then, after looking at the other player's answer and subclaims, they can provide objections (perhaps on particular subclaims, or the overall structure of the argument). Players can concede or challenge objections left by the other player; the debate then recurses on one of the challenged objections (currently, the malicious player chooses which claim to recurse on, in order to be conservative). There is a depth limit to the recursion, once it is reached (or if there are no challenged objections) the judge is given a transcript of the debate and must declare a winner.
A new problem arises: the malicious player can exploit ambiguity to their advantage. For example, suppose the malicious player is arguing that there are 5 primes between 0 and 10, and makes the subclaims "there are 3 primes between 0 and 5" and "there are 2 primes between 5 and 10". If challenged on the first, they will say "the three primes are 2, 3, and 5", if challenged on the second, they will say "the two primes are 5 and 7". While in this simple case the honest player could register a top-level objection claiming that the word "between" is ambiguous, and ask for clarification, in general it would be too hard to eliminate all ambiguity.
To mitigate this, the team added a method of _cross examination_, in which each player is allowed to ask an earlier version of the other agent a question, and quote its answer as part of the debate transcript. (The post talks about how this can be approximated with humans, but I won't go into the details.) In the complexity theory analogy, this boosts debate from PSPACE to NEXP. Intuitively, cross examination allows players to force each other to precommit to specific claims and arguments, because if they don't stay consistent in their claims and arguments, then their opponent can demonstrate the inconsistency. Now, if the malicious player tries to make an argument that depends on interpreting an ambiguous concept in two different ways, then the honest player can cross examine and ask which of the two interpretations they mean. If they are inconsistent, that can be demonstrated to the judge; if they consistently answer one way, then the honest player can challenge the part of the argument that depends on the other interpretation.
They then identify several open concerns with debate, of which they highlight the long computation problem. This is a problem when you no longer assume that the debaters have optimal play: in this case, the malicious player could create a complicated argument that neither debater understands well, that supports the malicious case but that the honest player doesn't know how to refute.

Planned opinion:

I enjoyed this a lot: the problems found were crisp and the solutions had good arguments that they actually solved the identified problem. Reading through the actual examples and arguments made me more optimistic about debate in general, mostly from a felt sense that the actual concrete results were getting closer to matching the theoretical ideal, and that there actually could be reasonable solutions to "messy" problems like ambiguity.
The full post has formal explanations and actual examples, which I highly recommend.
Writeup: Progress on AI Safety via Debate
We also rely on the property that, because the honest and dishonest debaters are copies of each other, they know everything the other knows.

I don't see why being copies of each other implies that they know everything the other knows: they could (rationally) spend their computation on understanding the details of their position, rather than their opponent's position.

For example, if I was playing against a copy of myself, where we were both given separate puzzles and had to solve them faster than the other, I would spend all my time focusing on my puzzle, and wouldn't know the things that my copy knows about his puzzle (even if we have the same information, i.e. both of us can see both puzzles and even each other's inner thought process).

Preface to EAF's Research Agenda on Cooperation, Conflict, and TAI
It seems to me that at least some parts of this research agenda are relevant for some special cases of "the failure mode of an amoral AI system that doesn't care about you".

I still wouldn't recommend working on those parts, because they seem decidedly less impactful than other options. But as written it does sound like I'm claiming that the agenda is totally useless for anything besides s-risks, which I certainly don't believe. I've changed that second paragraph to:

However, under other ethical systems (under which s-risks are worse than x-risks, but do not completely dwarf x-risks), I expect other technical safety research to be more impactful, because other approaches can more directly target the failure mode of an amoral AI system that doesn't care about you, which seems both more likely and more amenable to technical safety approaches (to me at least). I could imagine work on this agenda being quite important for _strategy_ research, though I am far from an expert here.
Load More