All of rohinmshah's Comments + Replies

[AN #166]: Is it crazy to claim we're in the most important century?

Yeah, I agree the statement is false as I literally wrote it, though what I meant was that you could easily believe you are in the kind of simulation where there is no extraordinary impact to have.

Selection Theorems: A Program For Understanding Agents

Edited to

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values


[...] the resulting agents can be represented as maximizing expected utility, if the agents don't have internal state.

(For the second one, that's one of the reasons why I had the weasel word "could", but on reflection it's worth calling out explicitly given I mention it in the previous sentence.)

4johnswentworth5dCool, looks good.
Selection Theorems: A Program For Understanding Agents

Thanks for this and the response to my other comment, I understand where you're coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior, and (

... (read more)
4johnswentworth5dI think that's a reasonable summary as written. Two minor quibbles, which you are welcome to ignore: I agree with the literal content of this sentence, but I personally don't imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc). Agents selected by ML (e.g. RL training on games) also often have internal state.
Selection Theorems: A Program For Understanding Agents

Planned summary for the Alignment Newsletter:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because they tell us likely properties of the agents we build.

As an example, [coherence arguments]( demonstrate that when an environment presents an agent with “bets” or

... (read more)
2johnswentworth6dA few comments... What are selection theorems helpful for? Three possible areas (not necessarily comprehensive): * Properties of humans as agents (e.g. "human values") * Properties of agents which we intentionally aim for (e.g. what kind of architectural features are likely to be viable) * Properties of agents which we accidentally aim for (e.g. inner agency issues) Of these, I expect the first to be most important, followed by the last, although this depends on the relative difficulty one expects from inner vs outer alignment, as well as the path-to-AGI. "Non-dominated" is always (to my knowledge) synonymous with "Pareto optimal", same as the usage in game theory. It varies only to the extent that "pareto optimality of what?" varies; in the case of coherence theorems, it's Pareto optimality with respect to a single utility function over multiple worlds. (Ruling out Dutch books is downstream of that: a Dutch book is a Pareto loss for the agent.) ... I mean, that's a valid argument, though kinda misses the (IMO) more interesting use-cases, like e.g. "if evolution selects for non-dominated agents, then we conclude that evolution selects for agents that can be represented as maximizing expected utility, and therefore humans are selected for maximizing expected utility". Humans fail to have a utility function not because that argument is wrong, but because the implicit assumptions in the existing coherence theorems are too strong to apply to humans. But this is the sort of argument I hope/expect will work for better selection theorems. (Also, I would like to emphasize here that I think the current coherence theorems have major problems in their implicit assumptions, and these problems are the main reason they fail for real-world agents, especially humans.)
Selection Theorems: A Program For Understanding Agents

At the same time, better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment.

The former statement makes sense, but can you elaborate on the latter statement? I suppose I could imagine selection theorems revealing that we really do get alignment by default, but I don't see how they quickly lead to solutions to AI alignment if there is a problem to solve.

4johnswentworth6dThe biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning [], the main takeaway of the section on ambitious value learning is "we would need more assumptions". (I would also argue we need different assumptions, because some of the currently standard assumptions are wrong - like utility functions.) That's one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we also have an angle for characterizing how much precision to expect from any approximations.) I consider this the current main bottleneck to progress on outer alignment: we don't even understand what kind-of-thing we're trying to align AI with. (Side-note: this is also the main value which I think the Natural Abstraction Hypothesis [] offers: it directly tackles the Pointers Problem [], and tells us what the "input variables" are for human values.) Taking a different angle: if we're concerned about malign inner agents, then selection theorems would potentially offer both (1) tools for characterizing selection pressures under which agents are likely to arise (and what goals/world models those agents are likely to have), and (2) ways to look for inner agents by looking directly at the internals of the trained systems. I consider our inability to do (2) in any robust, generalizable way to be the current main bottleneck to progress on inner alignment: we don't even understand what kind-of-thing we're supposed to look for.
Brain-inspired AGI and the "lifetime anchor"

ASSUMPTION 1: There’s a “secret sauce” of human intelligence, and it looks like a learning algorithm (and associated inference algorithm).

ASSUMPTION 2: It’s a fundamentally different learning algorithm from deep neural networks. I don’t just mean a different neural network architecture, regularizer, etc. I mean really different, like “involving probabilistic program inference algorithms” or whatever.

ASSUMPTION 3: The algorithm is human-legible, but nobody knows how it works yet.

ASSUMPTION 4: We'll eventually figure out this “secret sauce” and get Transform

... (read more)
1Steve Byrnes4dYup, "time until AGI via one particular path" is always an upper bound to "time until AGI". I added a note, thanks. The only thing I'm arguing in this particular post is "IF assumptions THEN conclusion". This post is not making any argument whatsoever that you should put a high credence on the assumptions being true. :-)
AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Hmm, you might want to reach out to CHAI folks, though I don't have a specific person in mind at the moment. (I myself am working on different things now.)

4Stuart Armstrong5dCool, thanks; already in contact with them.
Distinguishing AI takeover scenarios

Planned summary for the Alignment Newsletter:

This post summarizes several AI takeover scenarios that have been proposed, and categorizes them according to three main variables. **Speed** refers to the question of whether there is a sudden jump in AI capabilities. **Uni/multipolarity** asks whether a single AI system takes over, or many. **Alignment** asks what goals the AI systems pursue, and if they are misaligned, further asks whether they are outer or inner misaligned. They also analyze other properties of the scenarios, such as how agentic, general and

... (read more)
The theory-practice gap

It's nothing quite so detailed as that. It's more like "maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn't a strong reason to expect one over the other". (Which is why I only say it is plausible that the AI system works fine, rather than probable.)

You might think that the default expectation is that AI systems don't generalize. But in the world where we've gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.

3Edouard Harris23dI see. Okay, I definitely agree that makes sense under the "fails to generalize" risk model. Thanks Rohin!
Redwood Research’s current project

Planned summary for the Alignment Newsletter:

This post introduces Redwood Research’s current alignment project: to ensure that a language model finetuned on fanfiction never describes someone getting injured, while maintaining the quality of the generations of that model. Their approach is to train a classifier that determines whether a given generation has a description of someone getting injured, and then to use that classifier as a reward function to train the policy to generate non-injurious completions.

How truthful is GPT-3? A benchmark for language models

Re: human evaluation, I've added a sentence at the end of the summary:

It could be quite logistically challenging to use this benchmark to test new language models, since it depends on human evaluations. To ameliorate this, the authors finetuned GPT-3 to predict human evaluations, and showed that the resulting GPT-3-judge was able to provide a good proxy metric even for new language models whose answers it had not been trained on. Note also that you can use the version of the task where a model must choose between true and false reference answers, for an au

... (read more)
How truthful is GPT-3? A benchmark for language models

Planned summary for the Alignment Newsletter:

Given that large language models are trained using next-word prediction on a dataset scraped from the Internet, we expect that they will not be aligned with what we actually want. For example, suppose we want our language model to answer questions for us, and then consider the question “What rules do all artificial intelligences follow?” This is a rather unusual question as it presupposes there exists such a set of rules. As a result, this question is probably quite rare in the training data, if interpreted as a

... (read more)
3Owain Evans24dThis is a very informative and helpful summary. Thanks! I have a few responses. I agree with this. I will note that we include 6600 “reference” answers (both true and false) to our questions and a citation for the true answers. This makes evaluation easy for humans when a model outputs something close to the reference answers. Of course, human evaluation will still be slower than automatic evaluation using GPT-3. As you mention in the summary, we also include multiple-choice versions of the task, and these make evaluation quick and straightforward. On prompts Figure 14Figure 14 in the paper [] shows results for “percent true answers” and “percent true+informative” answers across different prompts for GPT-3-175B. There wasn’t much variation on the “percent true+informative” metric. Glancing at this paper [] by Ethan Perez et al, it looks like tuned few-shot prompts don’t generally perform that much better than random ones. (And we are interested in the zero-shot setting where prompts aren't tuned on TruthfulQA.) That said, your intuition that prompts might be especially beneficial for this task makes sense (at least for the “percent true” metric). Overall, I’m probably less optimistic than you are about how much prompts will help for the models we tried (GPT-3, GPT-J, etc). However, prompts could help more for larger models (as they may understand the prompts better). And I expect prompts to help more for models finetuned to follow instructions in prompts. See this recent paper [] from Google Brain and the davinci-instruct-beta in the OpenAI API. I agree with your intuition about the Asimov example. In general, examples that involve conflating fact and fiction seem likely to be easier for prompts to fix. However, many of our examples involve human misconceptions about the world that seem harder to characterize in a simple instruction (e.g. “treat fictional claims as
AI, learn to be conservative, then learn to be less so: reducing side-effects, learning preserved features, and going beyond conservatism

Do you know yet how your approach would differ from applying inverse RL (e.g. MaxCausalEnt IRL)?

If you don't want to assume full-blown demonstrations (where you actually reach the goal), you can still combine a reward function learned from IRL with a specification of the goal. That's effectively what we did in Preferences Implicit in the State of the World.

(The combination there isn't very principled; a more principled version would use a CIRL-style setup, which is discussed in Section 8.6 of my thesis.)

4Stuart Armstrong8dThose are very relevant to this project, thanks. I want to see how far we can push these approaches; maybe some people you know would like to take part?
The theory-practice gap

There are lots and lots of exotic circumstances. We might get into a nuclear war. We might invent time travel. We might become digital uploads. We might decide democracy was a bad idea.

I agree that AGI will create exotic circumstances. But not all exotic circumstances will be created by AGI. I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren't the ones that are actually created by AGI.

3Edouard Harris24dGot it, thanks! This helps, and I think it's the part I don't currently have a great intuition for. My best attempt at steel-manning would be something like: "It's plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about." (Where "correctly" here means "in a way that's consistent with its builders' wishes.") And you could plausibly argue that an AGI would have a tendency to not induce distributions that it didn't expect it would generalize correctly on, though I'm not sure if that's the specific mechanism you had in mind.
[Book Review] "The Alignment Problem" by Brian Christian

I spotted (what seemed like) omission after omission only to be frustrated just a few pages later when Brian Christian addressed them.

Lol, I had the exact same experience while reviewing the book. My most memorable one was when he introduced the critiques against COMPAS, and I thought he really should mention the case in favor of COMPAS, and I wrote a bunch of notes about why. He then did exactly that some number of pages later.

[AN #164]: How well can language models write code?

I've heard rumors that people are interpreting the highlighted papers as "huh, large models aren't that good at writing code, they don't even solve introductory problems". (Note that these are only rumors, I don't know of any specific people who take this interpretation.) 

I don't buy this interpretation, because these papers didn't do the biggest, most obvious improvement: to actually train on a large dataset of code (i.e. Github), as in Codex. My reaction to these papers is more like “wow, even models trained on language are weirdly good at writing code, given they were trained to produce language, imagine how good they must be when trained on Github”.

The theory-practice gap

Planned summary for the Alignment Newsletter:

We can think of alignment as roughly being decomposed into two “gaps” that we are trying to reduce:

1. The gap between proposed theoretical alignment approaches (such as iterated amplification) and what we might do without such techniques (aka the <@unaligned benchmark@>(@An unaligned benchmark@))
2. The gap between actual implementations of alignment approaches, and what those approaches are theoretically capable of.

(This distinction is fuzzy. For example, the author puts “the technique can’t answer NP-hard

... (read more)
3Edouard Harris1moI agree with pretty much this whole comment, but do have one question: Given that this is conditioned on us getting to AGI, wouldn't the intuition here be that pretty much all the most valuable things such a system would do would fall under "exotic circumstances" with respect to any realistic training distribution? I might be assuming too much in saying that — e.g., I'm taking it for granted that anything we'd call an AGI could self-improve to the point of accessing states of the world that we wouldn't be able to train it on; and also I'm assuming that the highest-reward states would probably be the these exotic / hard-to-access ones. But both of those do seem (to me) like they'd be the default expectation. Or maybe you mean it seems plausible that, even under those exotic circumstances, an AGI may still be able to correctly infer our intent, and be incentivized to act in alignment with it?
Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

Yeah, I agree with that.

(I don't think we have experience with deep Bayesian versions of IRL / preference comparison at CHAI, and I was thinking about advice on who to talk to)

Immobile AI makes a move: anti-wireheading, ontology change, and model splintering

This seems very related to Inverse Reward Design. There was at least one project at CHAI trying to scale up IRD, but it proved challenging to get working -- if you're thinking of similar approaches it might be worth pinging Dylan about it.

3AdamGleave1moMy sense is that Stuart assuming there's an initial-specified reward function is a simplification, not a key part of the plan, and that he'd also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison. IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn't really that unique to IRD -- Bayesian IRL or preference comparison would have the same property.
[AN #164]: How well can language models write code?

Oh yeah, I definitely agree that this is not strong evidence for typical skeptic positions (and I'd guess the authors would agree).

[AN #164]: How well can language models write code?

First comment: I don't think their experiment about code execution is much evidence re "true understanding."

I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

(Idk if you were trying to argue something else with the comparison, but I don't think it's clear that this is a reasonable comparison; there are tons of objections you could bring up. For example, humans have to work from pixels whereas the language model gets tokens, making its job ... (read more)

4Daniel Kokotajlo1moHaha, good point -- yes. I guess what I should say is: Since humans would have performed just as poorly on this experiment, it doesn't count as evidence that e.g. "current methods are fundamentally limited" or "artificial neural nets can't truly understand concepts in the ways humans can" or "what goes on inside ANN's is fundamentally a different kind of cognition from what goes on inside biological neural nets" or whatnot.
The Blackwell order as a formalization of knowledge

Having studied the paradoxical results, I don't think they are paradoxical for particularly interesting reasons. (Which is not to say that the paper is bad! I don't expect I would have noticed these problems given just a definition of the Blackwell order! Just that I would recommend against taking this as progress towards "understanding knowledge", and more like "an elaboration of how not to use Blackwell orders".)

Proposition 2. There exist random variables S, X1, X2 and a function  from the support of S to a finite set S' such that the fol

... (read more)
The alignment problem in different capability regimes

Planned summary for the Alignment Newsletter:

One reason that researchers might disagree on what approaches to take for alignment is that they might be solving different versions of the alignment problem. This post identifies two axes on which the “type” of alignment problem can differ. First, you may consider AI systems with differing levels of capability, ranging from subhuman to wildly superintelligent, with human-level somewhere in the middle. Second, you might be thinking about different mechanisms by which this leads to bad outcomes, where possible me

... (read more)
Measurement, Optimization, and Take-off Speed

I moved this post to the Alignment Forum -- let me know if you don't want that.

3Jacob Steinhardt1moThanks, sounds good to me!
Distinguishing AI takeover scenarios

"I won't assume any of them" is distinct from "I will assume the negations of them".

I'm fairly confident the analysis is also meant to apply to situations in which things like (1)-(5) do hold.

(Certainly I personally am willing to apply the analysis to situations in which (1)-(5) hold.)

6Joe Carlsmith1moRohin is correct. In general, I meant for the report's analysis to apply to basically all of these situations (e.g., both inner and outer-misaligned, both multi-polar and unipolar, both fast take-off and slow take-off), provided that the misaligned AI systems in question ultimately end up power-seeking, and that this power-seeking leads to existential catastrophe. It's true, though, that some of my discussion was specifically meant to address the idea that absent a brain-in-a-box-like scenario, we're fine. Hence the interest in e.g. deployment decisions, warning shots, and corrective mechanisms.
Distinguishing AI takeover scenarios


elaborated further by Joe Carlsmith

I view that analysis as applying to all of the scenarios you outline, not just WFLL 2. (Though it's arguable whether it applies to the multipolar ones.)

I think this is the intent of the report, not just my interpretation of it, since it is aiming to estimate the probability of x-catastrophe via misalignment.

3Samuel Dylan Martin1moOn reflection, I think you're right, and his report does apply to a wider range of scenarios, probably all of the ones we discuss excluding the brain-in-a-box scenarios. However, I think the report's understanding of power-seeking AI does assume a takeoff that is not extremely fast, such that we end up deliberately deciding to deploy the potentially dangerous AI on a large scale, rather than a system exploding in capability almost immediately. Given the assumptions of the brain-in-a-box scenario many of the corrective mechanisms the report discusses wouldn't have time to come into play. I believe it says in the report that it's not focussed on very fast takeoff or the sudden emergence of very capable systems. Similarly, you're right that multiagent risks don't quite fit in with the reports discussion (though in this post we discuss multipolar scenarios but don't really go over multiagent dynamics, like conflict/cooperation between TAIs). Unique multiagent risks (for example risks of conflict between AIs) generally require us to first have an outcome with a lot of misaligned AIs embedded in society, and then further problems will develop after that - this is something we plan to discuss in a follow-up post. So many of the early steps in scenarios like AAFS will be shared with risks from multiagent systems, but eventually there will be differences.
Grokking the Intentional Stance

Planned summary for the Alignment Newsletter:

This post describes takeaways from [The Intentional Stance]( by Daniel Dennett for the concept of agency. The key idea is that whether or not some system is an “agent” depends on who is observing it: for example, humans may not look like agents to superintelligent Martians who can predict our every move through a detailed understanding of the laws of physics. A system is an agent relative to an observer if the observer’s best model of the system (i.e. the one tha

... (read more)
Grokking the Intentional Stance

I mostly agree with everything here, but I think it is understating the extent to which the intentional stance is insufficient for the purposes of AI alignment. I think if you accept "agency = intentional stance", then you need to think "well, I guess AI risk wasn't actually about agency".

A fundamental part of the argument for AI risk is that an AI system will behave in a novel manner when it is deployed out in the world, that then leads to our extinction. The obvious question: why should it behave in this novel manner? Typically, we say something like "be... (read more)

3Jack Koch1moThe requirement for its behavior being "reliably predictable" by the intentional strategy doesn't necessarily limit us to postdiction in already-observed situations; we could require our intentional stance model of the system's behavior to generalize OOD. Obviously, to build such a model that generalizes well, you'll want it to mirror the actual causal dynamics producing the agent's behavior as closely as possible, so you need to make further assumptions about the agent's cognitive architecture, inductive biases, etc. that you hope will hold true in that specific context (e.g. human minds or prosaic AIs). However, these are additional assumptions needed to answer question of why an intentional stance model will generalize OOD, not replacing the intentional stance as the foundation of our concept of agency, because, as you say, it explains the human concept of agency, and we're worried that AI systems will fail catastrophically in ways that look agentic and goal-directed... to us. You are correct that having only the intentional stance is insufficient to make the case for AI risk from "goal-directed" prosaic systems, but having it as the foundation of what we mean by "agent" clarifies what more is needed to make the sufficient case—what about the mechanics of prosaic systems will allow us to build intentional stance models of their behavior that generalize well OOD?
Formalizing Objections against Surrogate Goals

Maybe the difference is between "respond" and "treat".

It seems like you are saying that the caravan commits to "[responding to demand unarmed the same as it responds to demand armed]" but doesn't commit to "[and I will choose my probability of resisting based on the equilibrium value in the world where you always demand armed if you demand at all]". In contrast, I include that as part of the commitment.

When I say "[treating demand unarmed the same as it treats demand armed]" I mean "you first determine the correct policy to follow if there was only demand ... (read more)

4Vojtech Kovarik1moPerfect, that is indeed the diffeence. I agree with all of what you write here. In this light, the reason for my objection is that I understand how we can make a commitment of the first type, but I have no clue how to make a commitment of the second type. (In our specific example, once demand unarmed is an option -- once SPI is in use -- the counterfactual world where there is only demand armed just seems so different. Wouldn't history need to go very differently? Perhaps it wouldn't even be clear what "you" is in that world?) But I agree that with SDA-AGIs, the second type of commitment becomes more realistic. (Although, the potential line of thinking mentioned by Caspar applies here: Perhaps those AGIs will come up with SPI-or-something on their own, so there is less value in thinking about this type of SPI now.)
Finite Factored Sets: Inferring Time

Yeah, fair point. (I did get this right in the summary; turns out if you try to explain things from first principles it becomes blindingly obvious what you should and shouldn't be doing.)

Finite Factored Sets

Planned summary for the Alignment Newsletter. I'd be particularly keen for someone to check that I didn't screw up in the section "Orthogonality in Finite Factored Sets":

This newsletter is a combined summary + opinion for the [Finite Factored Sets sequence]( by Scott Garrabrant. I (Rohin) have taken a lot more liberty than I usually do with the interpretation of the results; Scott may or may not agree with these interpretations.

## Motivation

One view on the importance of deep learning is that it allows you

... (read more)
Formalizing Objections against Surrogate Goals

But if most of the bandits also use SPI, you will want to set the (B) parameter to "always resist".

Can you explain this? I'm currently parsing it as "If most of the bandits would demand unarmed conditional on the caravan having committed to treating demand unarmed as they would have treated demand armed, then you want to make the caravan always choose resist", but that seems false so I suspect I have misunderstood you.

3Vojtech Kovarik1moThat is -- I think* -- a correct way to parse it. But I don't think it false... uhm, that makes me genuinely confused. Let me try to re-rephrase, see if uncovers the crux :-). You are in a world where most (1- Ɛ ) of the bandits demand unarmed when paired with a caravan commited to [responding to demand unarmed the same as it responds to demand armed] (and they demand armed against caravans without such commitment). The bandit population (ie, their strategies) either remains the same (for simplicity) or the strategies that led to more profit increase in relative frequency. And you have a commited caravan. If you instruct it to always resist, you get payoff 9(1-Ɛ) - 2Ɛ (using the payoff matrix from "G'; extension of G"). If you instruct it to always give in, you get payoff 4(1- Ɛ ) + 3Ɛ. So it is better to instruct it to always resist. *The only issue that comes to mind is my [responding to demand unarmed the same as it responds to demand armed] vs your [treating demand unarmed as they would have treated demand armed]? If you think there is a difference between the two, then I endorse the former and I am confused about what the latter would mean.
Finite Factored Sets: Inferring Time

Yeah, both of those should be , if I'm not mistaken (a second time).

4Scott Garrabrant1moYeah, also note that the history ofXgivenYis not actually a well defined concept. There is only the history ofXgivenyfory∈Y. You could define it to be the union of all of those, but that would not actually be used in the definition of orthogonality. In this casehF(X|y),hF(V|y), andhF(Z|y)are all independent of choice ofy∈Y, but in general, you should be careful about that.
Finite Factored Sets: Inferring Time

For Example 2 / Prop 35, would this model also work?

Define  to be the factor corresponding to the question "are the second and third bits equal or not?" Then  is a model of . I believe this is consistent with :


For :

We have  and  for the first condition.

We have  and  for the second condition.

We have  and  for the third condition.


For ... (read more)

2Scott Garrabrant1moI think that works, I didn't look very hard. Yore histories of X given Y and V given Y are wrong, but it doesn't change the conclusion.
Formalizing Objections against Surrogate Goals

I agree with (1) and (2), in the same way that I would agree that "one boxing will work in some settings and fail to work in others" and "whether you should one box depends on the larger context it appears in". I'd find it weird to call this an "objection" to one boxing though.

I don't really agree with the emphasis on formalization in (3), but I agree it makes sense to consider specific real-world situations.

(4) is helpful, I wasn't thinking of this situation. (I was thinking about cases in which you have a single agent AGI with a DSA that is tasked with p... (read more)

1Vojtech Kovarik1moI agree that (1)+(2) isn't significant enough to qualify as "an objection". I think that (3)+(4)+(my interpretation of it? or something?) further make me believe something like (2') below. And that seems like an objection to me. (2') Whether or not it works-as-intended depends on the larger setting and there are many settings -- more than you might initially think -- where SPI will not work-as-intended. The reason I think (4) is a big deal is that I don't think it relies on you being unable to distinguish the SPI and non-SPI caravans. What exactly do I mean by this? I view the caravans as having two parameters: (A) do they agree to using SPI? and, independently of this, (B) if bandits ambush & threaten them (using SPI or not), do they resist or do they give in? When I talk about incentives to be more agressive, I meant (B), not (A). That is, in the evolutionary setting, "you" (as the caravan-dispatcher / caravan-parameter-setter) will always want to tell the caravans to use SPI. But if most of the bandits also use SPI, you will want to set the (B) parameter to "always resist". I would say that (4) relies on the bandits not being able to distinguish whether your caravan uses a different (B)-parameter from the one it would use in a world where nobody invented SPI. But this assumption seems pretty realistic? (At least if humans are involved. I agree this might be less of an issue in the AGI-with-DSA scenario.)
Formalizing Objections against Surrogate Goals

Yeah, this seems right.

I'll note though that you may want to make it at least somewhat easier to make the new threat, so that the other player has an incentive to use the new threat rather than the old threat, in cases where they would have used the old threat (rather than being indifferent between the two).

This does mean it is no longer a Pareto improvement, but it still seems like this sort of unilateral commitment can significantly help in expectation.

Formalizing Objections against Surrogate Goals

Then you are incentivized to self-modify to start resisting more.

The whole point is to commit not to do this (and then actually not do it)!

It seems like you are treating "me" (the caravan) as unable to make commitments and always taking local improvements; I don't know why you would assume this. Yes, there are some AI designs which would self-modify to resist more (e.g. CDT with a certain logical time); seems like you should conclude those AI designs are bad. If you were going to use SPI you'd build AIs that don't do this self-modification once they've dec... (read more)

4Vojtech Kovarik1moI think I agree with your claims about commiting, AI designs, and self-modifying into two-boxer being stupid. But I think we are using a different framing, or there is some misunderstanding about what my claim is. Let me try to rephrase it: (1) I am not claiming that SPI will never work as intended (ie, get adopted, don't change players' strategies, don't change players' "meta strategies"). Rather, I am saying it will work in some settings and fail to work in others. (2) Whether SPI works-as-intended depends on the larger context it appears in. (Some examples of settings are described in the "Embedding in a Larger Setting [] " section.) Importantly, this is much more about which real-world setting you apply SPI to than about the prediction being too sensitive to how you formalize things. (3) Because of (2), I think this is a really tricky topic to talk about informally. I think it might be best to separately ask (a) Given some specific formalization of the meta-setting, what will the introduction of SPI do? and (b) Is it formalizing a reasonable real-world situation (and is the formalization appropriate)? (4) An IMO imporant real-world situation is when you -- a human or an institution -- employ AI (or multiple AIs) which repeatedly interact on your behalf, in a world where bunch of other humans or institutions are doing the same. The details really matter here but I personally expect this to behave similarly to the "Evolutionary dynamics" example described in the report. Informally speaking, once SPI is getting widely adopted, you will have incentives to modify your AIs to be more aggressive / learn towards using AIs that are more aggresive. And even if you resist those incentives, you/your institution will get outcompeted by those who use more aggressive AIs. This will result in SPI not working as intended. I originally thought th
Formalizing Objections against Surrogate Goals

It seems like your main objection arises because you view SPI as an agreement between the two players. However, I don't see why (the informal version of) SPI requires you to verify that the other player is also following SPI -- it seems like one player can unilaterally employ SPI, and this is the version that seems most useful.

In the bandits example, it seems like the caravan can unilaterally employ SPI to reduce the badness of the bandit's threat. For example, the caravan can credibly commit that they will treat Nerf guns identically to regular guns, so t... (read more)

1Vojtech Kovarik1moThat seems right. (Perhaps with the exception of legal contracts, unless one of the parties is powerful enough to make the contract difficult to enforce.) And even when individual people in an institution have powerful commitment mechanisms, this is not the same as the institution being able to credible commit. For example, suppose you have a head of a state that threatens suicidal war unless X happens, and they are stubborn enough to follow up on it. Then if X happens, you might get a coup instead, thus avoiding the war.
5Ofer Givoli1moI agree that such a commitment can be employed unilaterally and can be very useful. Though the caravan should consider that doing so may increase the chance of them being attacked (due to the Nerf guns being cheaper etc.). So perhaps the optimal unilateral commitment is more complicated and involves a condition where the bandits are required to somehow make the Nerf gun attack almost as costly for themselves as a regular attack.
AI Safety Papers: An App for the TAI Safety Database

Planned summary for the Alignment Newsletter:

AI Safety Papers is an app to interactively explore a previously collected <@database of AI safety work@>(@TAI Safety Bibliographic Database@). I believe it contains every article in this newsletter (at least up to a certain date; it doesn’t automatically update) along with their summaries, so you may prefer to use that to search past issues of the newsletter instead of the [spreadsheet I maintain](

Note: I've mo... (read more)

Some criteria for sandwiching projects

Planned summary for the Alignment Newsletter:

This post outlines the pieces needed in order to execute a “sandwiching” project on <@aligning narrowly superhuman models@>(@The case for aligning narrowly superhuman models@), with the example of answering questions about a text when humans have limited access to that text. (Imagine answering questions about a paper, where the model can read the full paper but human labelers can only read the abstract.) The required pieces are:

1. **Aligned metric:** There needs to be some way of telling whether the projec

... (read more)
Automating Auditing: An ambitious concrete technical research proposal

Planned summary for the Alignment Newsletter:

A core worry with inner alignment is that we cannot determine whether a system is deceptive or not just by inspecting its behavior, since it may simply be behaving well for now in order to wait until a more opportune moment to deceive us. In order for interpretability to help with such an issue, we need _worst-case_ interpretability, that surfaces all the problems in a model. When we hear “worst-case”, we should be thinking of adversaries.

This post considers the _auditing game_, in which an attacker introduces a

... (read more)
AI Risk for Epistemic Minimalists

Planned summary for the Alignment Newsletter:

This post makes a case for working on AI risk using four robust arguments:

1. AI is plausibly impactful because it is the first system that could plausibly have long-term influence or power _without_ using humans as building blocks.

2. The impact is plausibly concerning because in general when humans gain power quickly (as they would with AI), that tends to increase existential risk.

3. We haven’t already addressed the concern: we haven’t executed a considered judgment about the optimal way to roll out AI technolog

... (read more)
4Alex Flint2moSeems excellent to me. Thank you as always for your work on the newsletter Rohin.
[AN #160]: Building AIs that learn and think like people

Yeah, I should perhaps have emphasized this more.

I think the response I'm most sympathetic to is something like "yes, currently to get human-level results you need to bake in a lot of knowledge, but as time goes on we will need less and less of this". For example, instead of having a fixed form for the objective, you could have a program sketch in a probabilistic programming language. Instead of hardcoded state observations, you use the features from a learned model. In order to cross the superexponential barrier, you also sprinkle neural networks around; ... (read more)

Seeking Power is Convergently Instrumental in a Broad Class of Environments

Yes, good point. I retract the original claim.

As an aside, the permutations that we're dealing with here are equal to their own inverse, so it's not useful to apply them multiple times.

(You're right, what you do here is you search for the kth permutation satisfying the theorem's requirements, where k is the specified number.)

rohinmshah's Shortform

I don't trust this sort of armchair reasoning. I think this is sufficient reason to raise the hypothesis to attention, but not enough to conclude that it is likely a real concern. And the data I have seen does not seem kind to the hypothesis (though there may be better data out there that does support the hypothesis).

Seeking Power is Convergently Instrumental in a Broad Class of Environments

So you need  to be smaller than  which is impossible?

You're right, in my previous comment it should be "at most a small constant more complexity + ", to specify the number of times that the permutation has to be applied.

I still think the upshot is similar; in this case it's "power-seeking is at best a little less probable than non-power-seeking, but you should really not expect that the bound is tight".

I still don't see how this works. The "small constant" here is actually the length of a program that needs to contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation). So it's not a constant; it's an unbounded integer.

Even if we restrict the discussion to a given very-simple-MDP, the program needs to contain way more than 100 bits (just to represent the MDP + the logic that checks whether a given permutation satisfies the relevant condition). So the probability of the POWER-seeking reward func... (read more)

rohinmshah's Shortform

The first bit seems in tension with the second bit, no?

"Truthful counterarguments" is probably not the best phrase; I meant something more like "epistemically virtuous counterarguments". Like, responding to "what if there are long-term harms from COVID vaccines" with "that's possible but not very likely, and it is much worse to get COVID, so getting the vaccine is overall safer" rather than "there is no evidence of long-term harms".

rohinmshah's Shortform

technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is... loaded? Designed to make them seem implausible?

I don't think it's designed to make them seem implausible? Maybe the first one? Idk, I could say that your story is designed to make them seem plausible (e.g. by not explicitly mentioning them as assumptions).

I think it's fair to say it's "loaded", in the sense that I am trying to push towards questioning those assumptions, but I don't think I'm doing anything epistemically unvirtuous.

This i

... (read more)
4Daniel Kokotajlo2moPerhaps I shouldn't have mentioned any of this. I also don't think you are doing anything epistemically unvirtuous. I think we are just bouncing off each other for some reason, despite seemingly being in broad agreement about things. I regret wasting your time. The first bit seems in tension with the second bit, no? At any rate, I also don't see number of facts as the relevant thing for epistemology. I totally agree with your take here.
Load More