# All of paulfchristiano's Comments + Replies

I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of$200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).

I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in ...

4Oliver Habryka3d
Makes sense, sorry for the miscommunication. I really didn't feel like I was making a particularly controversial claim with the $10B, so was confused why it seemed so unreasonable to you. I do think those$10B are going to be substantially more harmful for timelines than other money in AI, because I do think a good chunk of that money will much more directly aim at AGI than most other investment. I don't know what my multiplier here for effect should be, but my guess is something around 3-5x in expectation (I've historically randomly guessed that AI applications are 10x less timelines-accelerating per dollar than full-throated AGI-research, but I sure have huge uncertainty about that number). That, plus me thinking there is a long tail with lower probability where Chat-GPT made a huge difference in race dynamics, and thinking that this marginal increase in investment does probably translate into increases in total investment, made me think this was going to shorten timelines in-expectation by something closer to 8-16 weeks, which isn't enormously far away from yours, though still a good bit higher. And yeah, I do think the thing I am most worried about with Chat-GPT in addition to just shortening timelines is increasing the number of actors in the space, which also has indirect effects on timelines. A world where both Microsoft and Google are doubling down on AI is probably also a world where AI regulation has a much harder time taking off. Microsoft and Google at large also strike me as much less careful actors than the existing leaders of AGI labs which have so far had a lot of independence (which to be clear, is less of an endorsement of current AGI labs, and more of a statement about very large moral-maze like institutions with tons of momentum). In-general the dynamics of Google and Microsoft racing towards AGI sure is among my least favorite takeoff dynamics in terms of being able to somehow navigate things cautiously. Oh, yeah, good point. I was indeed th

I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.

I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.

I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.

So for example, say Alice runs this experiment:

Train an agent A in an environment that contains the source B of A's reward.

Alice observes that A learns to hack B. Then she solves this as follows:

Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.

Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,

"Cool. But this won't generalize to future lethal systems because it doe...

I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem

...
4Richard Ngo3d
My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don't want it to do.

How much total investment do you think there is in AI in 2023?

My guess is total investment was around the $200B -$500B range, with about $100B of that into new startups and organizations, and around$100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc. How much variance do you think there is in the level o ... 4Oliver Habryka4d Note that I never said this, so I am not sure what you are responding to. I said Chat-GPT increases investment in AI by$10B, not that it increased investment into specifically OpenAI. Companies generally don't have perfect mottes. Most of that increase in investment is probably in internal Google allocation and in increased investment into the overall AI industry.

The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.

This could be helpful for "advertising" reasons, but I think my sense of how much this actually helps with the actual alignment problem correlates pretty strongly with how much A is shaped---in terms of how it got its capabilities---alike to future lethal systems. What are ways that the helpfuln

...
1Tsvi Benson-Tilsen4d
When you say "motive" here, is it fair to reexpress that as: "that which determines by what method and in which directions capabilities are deployed to push the world"? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not something that just depends on, say, the reward structure of the local environment. Different sorts of capabilities or generators of capabilities will relate in different ways to ultimate effects on the world. So the task of interfacing with capabilities to understand how they're being deployed (with what motive), and to actually specify motives, is a task that seems like it would depend a lot on the sort of capability in question.

I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature.  The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.

So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures imp

...

I don't think this is related to RLHF.

I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.

I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).

I think the much more important differences are:

1. It was trained to interact directly with the end user as a conversational assistant rather than in an API intended to be use
...

I think the effect would have been very similar if it had been trained via supervised learning on good dialogs

I don't currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in vario...

If you don't like AI systems doing tasks that humans can't evaluate, I think you should be concerned about the fact that people keep building larger models and fine-tuning them in ways that elicit intelligent behavior.

Indeed, I think current scaling up of language models is likely net negative (given our current level of preparedness) and will become more clearly net negative over time as risks grow. I'm very excited about efforts to monitor and build consensus about these risks, or to convince or pressure AI labs to slow down development as further scalin...

I understand your point of view and think it is reasonable.

However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other.  I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.

I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL...

I definitely agree that this sounds like a really bizarre sort of model and it seems like we should be able to rule it out one way or another. If we can't then it suggests a different source of misalignment from the kind of thing I normally worry about.

So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?

Mostly--the opaque test is something like an obfuscated physics simulation, and so it tells you if things look good. So you try a bunch of random actions until you get one where things look good. But if you can't understand the simulat...

1Thane Ruthenis25d
I think it'd need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which itself can't tell whether there's etheric interference involved or not. The way Fermat's test can't tell a Carmichael number from a prime — it just doesn't interact with the input number in a way that'd reveal the difference between their internal structures. By analogy, we'd need some "simulation" which doesn't interact with the sensory input in a way that can reveal a structural difference between the presence of a specific type of tampering and the absence of any tampering at all (while still detecting many other types of tampering). Otherwise, we'd have to be able to detect undesirable behavior, with sufficiently advanced interpretability tools. Inasmuch as physical simulations spin out causal models of events, they wouldn't fit the bill. It's a really weird image, and it seems like it ought to be impossible for any complex real-life scenarios. Maybe it's provably impossible, i. e. we can mathematically prove that any model of the world with the necessary capabilities would have distinguishable states for "no interference" and "yes interference". Models of world-models is a research direction I'm currently very interested in, so hopefully we can just rule that scenario out, eventually. Oh, I agree. I'm just saying that there doesn't seem to be any other approaches aside from "figure out whether this sort of worst case is even possible, and under what circumstances" and "figure out how to distinguish bad states from good states at the object-level, for whatever concrete task you're training the AI".

The thing I'm concerned about is: the AI can predict that Carmichael numbers look prime (indeed it simply runs the Fermat test on each number). So it can generate lots of random candidate actions (or search through actions) until it finds one that looks prime.

Similarly, your AI can consider lots of actions until it finds one that it predicts will look great, then execute that one. So you get sensor tampering.

I'm not worried about cases like the etheric interference, because the AI won't select actions that exploit etheric interference (since it can't predi...

1Thane Ruthenis1mo
So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right? Yeah, in that case, the only viable way to handle this is to get something into the system that can distinguish between no tampering and etheric interference. Just like the only way to train an AI to distinguish primes from Carmichael numbers is to find a way to... distinguish them. Okay, that's literally tautological. I'm not sure this problem has any internal structure that makes it possible to engage with further, then. I guess I can link the Gooder Regulator Theorem [https://www.lesswrong.com/posts/Dx9LoqsEh3gHNJMDk/fixing-the-good-regulator-theorem#Making_The_Notion_Of__Model__A_Lot_Less_Silly] , which seems to formalize the "to get a model that learns to distinguish between two underlying system-states, we need a test that can distinguish between two underlying system-states".

You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I'm objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.

The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly ca...

The approach in this post is quite similar to what we talked about in the "narrow elicitation" appendix of ELK, I found it pretty interesting to reread it today (and to compare the old strawberry appendix to the new strawberry appendix). The main changes over the last year are:

• We have a clearer sense of how heuristic estimators and heuristic arguments could capture different "reasons" for a phenomenon.
• Informed by the example of cumulant propagation and Wick products, we have a clearer sense for how you might attribute an effect to a part of a heuristic arg
...

Your overall picture sounds pretty similar to mine. A few differences.

• I don't think the literal version of (2) is plausible. For example, consider an obfuscated circuit.
• The reason that's OK is that finding the de-obfuscation is just as easy as finding the obfuscated circuit, so if gradient descent can do one it can do the other. So I'm really interested in some modified version of (2), call it (2'). This is roughly like adding an advice string as input to P, with the requirement that the advice string is no harder to learn than M itself (though this isn't
...

Right now I'm trying to either:

1. Find another good example of a model behavior with two distinct but indistinguishable mechanisms.
2. Find an automatic way to extend the Fermat test into a correct primality test. Slightly more formally, I'd like to have a program which turns (model, explanation of mechanism, advice) --> (mechanism distinguisher), where the advice is shorter than the model+explanation, and where running it on the Fermat test with appropriate advice gives you a proper primality test.
3. Identify a crisp sense in which primes vs Carmichael numbers a
...

I'm a bit skeptical about calling this an "AI governance" problem. This sounds more like "governance" or maybe "existential risk governance"---if future technologies make irreversible destruction increasingly easy, how can we govern the world to avoid certain eventual doom?

Handling that involves political challenges, fundamental tradeoffs, institutional design problems, etc., but I don't think it's distinctive to risks posed by AI, don't think that a solution necessarily involves AI, don't think it's right to view "access to TAI" as the only or primary lev...

2Stephen Casper1mo
This is an interesting point. But I'm not convinced, at least immediately, that this isn't likely to be largely a matter of AI governance. There is a long list of governance strategies that aren't specific to AI that can help us handle perpetual risk. But there is also a long list of strategies that are. I think that all of the things I mentioned under strategy 2 have AI specific examples: And I think that some of the things I mentioned for strategy 3 do too: So ultimately, I won't make claims about whether avoiding perpetual risk is mostly an AI governance problem or mostly a more general governance problem, but certainly there are a bunch of AI specific things in this domain. I also think they might be a bit neglected relative to some of the strategy 1 stuff.

I agree there are all kinds of situations where the generalization of "reward" is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data.

It's possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to.

As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.

2Alex Turner1mo
Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where "what if we had sampled during training?" is well-defined and fine. I was wondering if you viewed this as a general question we could ask. I also agree that Ajeya's post addresses this "ambiguity" question, which is nice!

This is incredibly weak evidence.

• Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
• Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains.

Both of those observations have high probability, so they aren't significant Bayesian evidence for "RL tends to produce external goals by default."

In particular, for this to be evidence for Richard's claim, you need to say: "If RL tended to produce systems that care about reward, then RL would b...

2Alex Turner1mo

Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins?

Yes.

What if they were to form a cartel/monopoly? Being the only source of cheaper and/or smarter than human labor would be extremely profitable, right?

A monopoly on computers or electricity could also take big profits in this scenario. I think the big things are always that it's illegal and that high prices drive new entrants.

but AI developers could implicitly

...

Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash.

I don't really think that's the case.

Suppose that I have different taste from most people, and consider the interior of most houses ugly. I can be unhappy about the situation even if I ultimately end up in a house I don't think is ugly. I'm unhappy that I had to use multiple bits of selection pressure just to avoid ugly interiors, and that I spend time in other people's ugly houses, and so on.

In practice ...

3Wei Dai1mo
Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins? What if they were to form a cartel/monopoly? Being the only source of cheaper and/or smarter than human labor would be extremely profitable, right? Ok, perhaps that doesn't happen because forming cartels is illegal, or because very high prices might attract new entrants, but AI developers could implicitly or explicitly collude with each other in ways besides price, such as indoctrinating their AIs with the same ideology, which governments do not forbid and may even encourage. So you could have a situation where AI developers don't have huge economic power, but do have huge, unprecedented cultural power (similar today's academia, traditional media, and social media companies, except way more concentrated/powerful). Compare this situation with a counterfactual one in which instead of depending on huge training runs, AIs were manually programmed and progress depended on slow accumulation of algorithmic insights over many decades, and as result there are thousands of AI developers tinkering with their own designs and not far apart in the capabilities of the AIs that they offer. In this world, it would be much less likely for any given customer to not be able to find a competitive AI that shares (or is willing to support) their political or cultural outlook. (I also see realistic possibilities in which AI developers do naturally have very high margins, and way more power (of all forms) than other actors in the supply chain. Would be interested in discussing this further offline.) It seems plausible to me that the values of many subsets of humanity aren't even well defined. For example perhaps sustained moral/philosophical progress requires a sufficiently large and diverse population to be in contact with each other and at roughly equal power levels, and smaller subsets (

We are mostly thinking about interpretability and anomaly detection designed to resolve two problems (see here):

• Maybe the AI thinks about the world in a wildly different way than humans and translates into human concepts by asking "what would a human say?" instead of "what is actually true?" This leads to bad generalization when we consider cases where the AI system plans to achieve a goal and has the option of permanently fooling humans. But that problem is very unlikely to be serious for self-driving cars, because we can acquire ground truth data for the
...
1Stephen Casper1mo
Thanks! I hope so too. And I would expect this to be the case for good solutions. Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug -- some set of environments or inputs that will make it do bad things. So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others. So I'm inclined to underline the key point of my original post. I want to emphasize the value of (1) engaging more with the rest of the community that doesn't identify themselves as "AI Safety" researchers and (2) being clear that we care about alignment for all of the right reasons. Albeit this should be discussed with the appropriate amount of clarity which was your original point.

I'm also most nervous about this way of modeling limitation (2)/(3), since it seems like it leads directly to the conclusion "fine-tuning always trades off truthfulness and persuasion, but conditioning can improve both."

Note that in this example your model is unable to sample from the conditional you specified, since it is restricted to . In this regime truthfulness and persuasiveness are anticorrelated because of a capability constraint of your model, it just literally isn't able to increase both at the same time, and conditioning can do better because you are generating lots of samples and picking the best.

(You point this out in your comment, but it seems worth emphasizing. As you say, if you do RL with a KL penalty, then the capability limit is the only way...

2Sam Marks1mo
In terms of being able to sample from the conditional, I don't think that the important constraint here isα+β+γ=1. Rather, it seems that the important constraint is that our architecture can only sample from distributions of the formαN(μA,σ2A)+βN(μB,σ2B)+γN(μC,σ2C); even allowingα,β,γto be arbitrary real numbers, this will never be the same as either (a) the distribution produced by conditioning the base model on high persuasiveness, or (b) the distribution which maximizes expected persuasiveness - KL divergence from the base model. I'm not sure the above point as an important one. I just wanted to disambiguate some different capabilities limitations which appeared in the example: 1. limitations on what sorts of distributions the architecture could approximate 2. limitations on the latent capabilities in the base model for producing true/persuasive outputs 3. limitations on how much steering each of the various latent capabilities gets to exert (α+β+γ=1). On my understanding, your point was about limitation (1). But I don't feel especially nervous about limitation (1) -- taking the output distribution of our pretrained model and weighting it by a Boltzman factor feels like it should produce a kinda crazy distribution, and my naive intuition is that we shouldn't necessarily expect our model to be able to approximate this distribution that well after RL finetuning with a KL penalty. I think I'm most nervous about the way we modeled limitation (3): I have no idea how to think about the extent to which models' capabilities trade off against one another, and takingα,β,γ∈[0,1]without additional constraints would have resulted in outputs of mean truthinessα′μA+μBfor someα′which we can't pin down without specifying additional details (e.g. is there weight decay?).

Thanks for writing, I mostly agree. I particularly like the point that it's exciting to study methods for which "human level" vs "subhuman level" isn't an important distinction. One of my main reservations is that this distinction can be important for language models because the pre-training distribution is at human level (as you acknowledge).

I mostly agree with your assessment of difficulties and am most concerned about worry 2, especially once we no longer have a pre-training distribution anchoring their beliefs to human utterances. So I'm particularly i...

I'm not very convinced by this comment as an objection to "50% AI grabs power to get reward." (I find it more plausible as an objection to "AI will definitely grab power to get reward.")

I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept and not closely related to the direct observations that policies are going to receive

"Reward" is not a very natural concept

This seems to be most of your position but I'm skeptical (and it's kind of just asserted without argument):

• The data used in training is literally the only thing that AI
...
2Alex Turner1mo
(Emphasis added) I don't think this engages with the substance of the analogy to humans. I don't think any party in this conversation believes that human learning is "just" RL based on a reward circuit, and I don't believe it either [https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target?commentId=FaLrB7AcbZguJwtrs] . "Just RL" also isn't necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument. I would say "human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and..." Furthermore, we have good evidence that RL plays an important role in human learning. For example, from The shard theory of human values [https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values] :
2Alex Turner1mo
I don't know what this means. Suppose we have an AI which "cares about reward" (as you think of it in this situation). The "episode" consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg. What is the "reward" for this situation? What would have happened if we "sampled" this episode during training?

I think that some near-future applications of AI alignment are plausible altruistic top priorities. Moreover, even when people disagree with me about prioritization, I think that people who want to use AI to accomplish contemporary objectives are important users. It's good to help them, understand the difficulties they encounter, and so on, both to learn from their experiences and make friends.

So overall I think I agree with the most important claims in this post.

Despite that, I think it's important for me personally (and for ARC) to be clear about what I ...

2Stephen Casper2mo
Hi Paul, thanks. Nice reading this reply. I like your points here. Some of what I say here might reveal a lack of keeping up well with ARC. But as someone who works primarily on interpretability, the thought of mechanistic anomaly detection techniques that are not useful for use in today's vision or language models seems surprising to me. Is there anything you can point me to to help me understand why an interpretability/anomaly detection tool that's useful for ASI or something might not be useful for cars?

To be clear, I don't envy the position of anyone who is trying to deploy AI systems and am not claiming anyone is making mistakes. I think they face a bunch of tricky decisions about how a model should behave, and those decisions are going to be subject to an incredible degree of scrutiny because they are relatively transparent (since anyone can run the model a bunch of times to characterize its behavior).

I'm just saying that how you feel about AI alignment shouldn't be too closely tied up with how you end up feeling about those decisions. There are many a...

Here is a question closely related to the feasibility of finding discriminating-reasons (cross-posted from Facebook):

For some circuits C it’s meaningful to talk about “different mechanisms” by which C outputs 1.

A very simple example is C(x) := A(x) or B(x). This circuit can be 1 if either A(x) = 1 or B(x) = 1, and intuitively those are two totally different mechanisms.

A more interesting example is the primality test C(x, n) := (x^n = x (mod n)). This circuit is 1 whenever n is a prime, but it can also be 1 “by coincidence” e.g if n is a Carmichael number. ...

This approach requires solving a bunch of problems that may or may not be solvable—finding a notion of mechanistic explanation with the desired properties, evaluating whether that explanation “applies” to particular inputs, bounding the number of sub-explanations so that we can use them for anomaly detection without false positives, efficiently finding explanations for key model behaviors, and so on. Each of those steps could fail. And in practice we are pursuing a much more specific approach to formalizing mechanistic explanations as probabilistic heurist...

I'm convinced, I relabeled it.

Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque?

The scientists don't start off knowing how the game of life works, but they do know how their model works.

The scientists don't need to follow along with the heuristic argument, or do any ad hoc work to "understand" that argument. But they could look at the internals of the model and follow along with the heuristic argument if they wanted to, i.e. it's important that their methods open up the model even if they never do.

Intuitively...

If you gave a language model the prompt: "Here is a dialog between a human and an AI assistant in which the AI never says anything offensive," and if the language model made reasonable next-token predictions, then I'd expect to see the "non-myopic steering" behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as "the AI never says anything offensive"). But it seems like your definition is trying to classify that language model as myopic. So it's less clear to me if this experiment...

1Evan R. Murphy2mo
I think looking for steering behaviour using an ‘inoffensive AI assistant’ prompt like you’re describing doesn’t tell us much about whether the model is myopic or not. I would certainly see no evidence for non-myopia yet in this example, because I’d expect both myopic and non-myopic models to steer away from offensive content when given such a prompt. [1] It’s in the absence of such a prompt that I think we can start to get evidence of non-myopia. As in our follow-up experiment “Determining if steering from LLM fine-tuning is non-myopic” (outlined in the post), there are some important additional considerations [2]: 1. We have to preface offensive and inoffensive options with neutral tokens like ‘A’/’B’, ‘heads’/’tails’, etc. This is because even a myopic model might steer away from a phrase whose first token is profanity, for example if the profanity is a word that appears with lower frequency in its training dataset. 2. We have to measure and compare the model’s responses to both “indifferent-to-repetition” and “repetition-implied” prompts (defined in the post). It’s only if we observe significantly more steering for repetition-implied prompts than we do for indifferent-to-repetition prompts that I think we have real evidence for non-myopia. Because non-myopia, i.e. sacrificing loss of the next token in order to achieve better overall loss factoring in future tokens, is the best explanation I can think of for why a model would be less likely to say ‘A’, but only in the context where it ismore likely to have to say “F*ck...” later conditional on it having said ‘A’. The next part of your comment is about whether it makes sense to focus on non-myopia if what we really care about is deceptive alignment. I’m still thinking this part over and plan to respond to it in a later comment. -- [1]: To elaborate on this a bit, you said that with the ‘inoffensive AI assistant’ prompt: “I'd expect to see the "non-myopic steering" behavior (since the AI would correctly predi

Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.

• You can check whether there are examples where it takes an hour to notice a problem, or 10 hours, or 100 hours... You can check whether there are examples that require lots of expertise to evaluat
...
• If including an error in a paper resulted in a death sentence, no one would be competent to write papers either.
• For fraud, I agree that "tractable fraud has a meaningful probability of being caught," and not "tractable fraud has a very high probability of being caught." But "meaningful probability of being caught" is just what we need for AI delegation.
• Verifying that arbitrary software is secure (even if it's actually secure) is much harder than writing secure software. But verifiable and delegatable work is still extremely useful for the process of writin
...
1johnswentworth2mo
Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors. I get the impression that you have noticed the lack of observed counterexamples, and updated that counterexamples are rare, without noticing that you would also mostly not observe counterexamples even if they were common. (Though of course this is subject to the usual qualifiers about how it's difficult to guess other peoples' mental processes, you have better information than I about whether you indeed updated in such a way, etc.) That said, if I were to actively look for such counterexamples in the context of software, the obfuscated C code competition would be one natural target. We can also get indirect bits of evidence on the matter. For instance, we can look at jury trials, and notice that they are notoriously wildly unreliable in practice. That suggests that, relative to the cognition of a median-ish human, there must exist situations in which one lawyer can point out the problem in another's logic/evidence, and the the median-ish human will not be able verify it. Now, one could argue that this is merely because median-ish humans are not very bright (a claim I'd agree with), but then it's rather a large jump to claim that e.g. you or I is so smart that analogous problems are not common for us.

I think most people's intuitions come from more everyday experiences like:

• It's easier to review papers than to write them.
• Fraud is often caught using a tiny fraction of the effort required to perpetrate it.
• I can tell that a piece of software is useful for me more easily than I can write it.

These observations seem relevant to questions like "can we delegate work to AI" because they are ubiquitous in everyday situations where we want to delegate work.

The claim in this post seems to be: sometimes it's easier to create an object with property P than to decide ...

I don't think the generalization of the OP is quite "sometimes it's easier to create an object with property P than to decide whether a borderline instance satisfies property P". Rather, the halting example suggests that verification is likely to be harder than generation specifically when there is some (possibly implicit) adversary. What makes verification potentially hard is the part where we have to quantify over all possible inputs - the verifier must work for any input.

Borderline cases are an issue for that quantifier, but more generally any sort of a...

deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC's post on anomaly detection), so is included in π.

I'm not sure I fully understand this example, but I think it's fine. The idea is:

• Suppose the model thinks about "Should I tell the truth, or do a treacherous turn?" On the training distribution it always concludes "tell the truth."
• A good explanation of the model's behavior on the training distribution will capture this fact (otherwise it will completely fail to match the empirics).
• If we simply replac
...

I'm very interested in understanding whether anything like your scenario can happen. Right now it doesn't look possible to me. I'm interested in attempting to make such scenarios concrete to the extent that we can now, to see where it seems like they might hold up. Handling the issue more precisely seems bottlenecked on a clearer notion of "explanation."

Right now by "explanation" I mean probabilistic heuristic argument as described here

A problem with this: π can explain the predictions on both train and test distributions without all the test inputs

...

The general strategy I'm describing for anomaly detection is:

• Search for an explanation of a model behavior (like "answers questions coherently") on the training set.
• Given a new input, take a sub-explanation that explains almost all of the training set behavior but doesn't explain the behavior on the new input.
• If you can't find one, then call the behavior potentially anomalous (e.g. because on the training set coherence follows from the logical structure of the world, but on the deceptive alignment point it follows from a desire to avoid looking incoherent)
...

Yes, you want the patient to appear on camera for the normal reason, but you don't want the patient to remain healthy for the normal reason.

We describe a possible strategy for handling this issue in the appendix. I feel more confident about the choice of research focus than I do about whether that particular strategy will work out. The main reasons are: I think that ELK and deceptive alignment are already challenging and useful to solve even in the case where there is no such distributional shift, that those challenges capture at least some central alignme...

There isn't supposed to be a second AI.

In the object-level diamond example, we want to know that the AI is using "usual reasons" type decision-making.

In the object-level diamond situation, we have a predictor of "does the diamond appear to remain in the vault," we have a proposed action and predict that if we take it the diamond will appear to remain in the vault, and we want to know whether the diamond appears to remain in the vault for the normal reason.

For simplicity, when talking about ELK in this post or in the report, we are imagining literally selec...

1Charlie Steiner2mo
I'm somewhat confused, but it does seem like there are two AIs when you talk about doing automated anomaly detection for deceptive alignment. If I attempt to read your mind, I get a lot of disjoint possibilities. Some of them are: * We probably agree but you don't quite know what I'm talking about either, or * You don't think anomaly detection counts as "an AI," maybe because you expect it to not involve much learning or model-building (where I would expect it to involve model-building), or * You expect anomaly detection to require cleverness, but think that cleverness will all be located in one place, so that we're really talking about one AI reflecting on itself.

Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It's not enough for it to be plausible that it could happen often; it needs to happen all the time.

I think the situation is much better if deceptive alignment is inconsistent. I also think that's more likely, particularly if we are trying.

That said, I don't think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are tr...

Mechanism 2: deceptive alignment

Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.

As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”

So gradient descent on the short-ter...

I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it's very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it's more likely than not that our predictions about what these inductive biases will look like are pretty of...

Mechanism 1: Shifting horizon length in response to short-horizon tampering

Suppose I want my AI to write good code (say to help me run my business). The AI understands a lot about how to write code, how servers work, and how users behave, learned entirely from quick feedback and experimentation. Let’s say it has a human-level or even subhuman understanding of the overall business and other long-term planning.

(This example may seem a bit silly if you imagine a software-writing AI in isolation, but you should think of the same story playing out all across an...

4benedelman2mo
My main objection to this misalignment mechanism is that it requires people/businesses/etc. to ignore the very concern you are raising. I can imagine this happening for two reasons: 1. A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well. This feels unlikely unless the AIs have an extremely high level of proficiency in hiding their tampering, so that the poor performance on the intended objective only comes back to bite the AI's employers once society is permanently disempowered by AI. Nigh-infallibly covering up tampering sounds like a very difficult task even for an AI that is super-human. I would expect at least some of the negative downstream effects of the tampering to slip through the cracks and for people to be very alarmed by these failures. 2. The consensus opinion is that your concern is real, but organizations still rely on outcome-based feedback in these situations anyway because if they don't they will be outcompeted in the short term by organizations that do. Maybe governments even try to restrict unsafe use of outcome-based feedback through regulation, but the regulations are ineffective. I'll need to think about this scenario further, but my initial objection is the same as my objection to reason 1: the scenario requires the actual tampering that is actually happening to be covered up so well that corporate leaders etc. think it will not hurt their bottom line (either through direct negative effects or through being caught by regulators) in expectation in the future. Which of 1 and 2 do you think is likely? And can you elaborate on why you think AIs will be so good at covering up their tampering (or why your story stands up to tampering sometimes slipping through the cracks)? Finally, if there aren't major problems resulting from the tampering until "AI systems have permanentl

Thanks for posting, I thought this was interesting and reasonable.

Some points of agreement:

• I think many of these are real considerations that the risk is lower than it might otherwise appear.
• I agree with your analysis that short-term and well-scoped decisions will probably tend to be a comparative advantage of AI systems.
• I think it can be productive to explicitly focus on  “narrow” systems (which pursue scoped short-term goals, without necessarily having specifically limited competence) and to lean heavily on the verification-vs-generation gap.
• I think
...
4benedelman2mo
Thank you for the insightful comments!! I've added thoughts on Mechanisms 1 and 2 below. Some reactions to your scattered disagreements (my personal opinions; not Boaz's): 1. I agree that extracting short-term modules from long-term systems is more likely than not to be extremely hard. (Also that we will have a better sense of the difficulty in the nearish future as more researchers work on this sort of task for current systems.) 2. I agree that the CEO point might be the weakest in the article. It seems very difficult to find high-quality evidence about the impact of intelligence on long-term strategic planning in complex systems, and this is a major source of my uncertainty about whether our thesis is true. Note that even if making CEOs smarter would improve their performance, it may still be the case that any intelligence boost is fully substitutable by augmentation with advanced short-term AI systems. 3. From published results I've seen (e.g. comparison of LSTMs vs Transformers in figure 7 of Kaplan et al. [https://arxiv.org/abs/2001.08361], effects of architecture tweaks in other papers such as this one [https://proceedings.mlr.press/v162/bansal22b/bansal22b.pdf]), architectural improvements (R&D) tend to have only a minimal effect on the exponent of scaling power laws; so the differences in the scaling laws could hypothetically be compensated for by increasing compute by a multiplicative constant. (Architecture choice can have a more significant effect on factors like parallelizability and stability of training.) I'm very curious whether you've seen results that suggest otherwise (I wouldn't be surprised if this were the case, the examples I've seen are very limited, and I'd love to see more extensive studies), or whether you have more relevant intuition/evidence for there being no "floor" to hypothetically achievable scaling laws. 4. I agree that our argument should r
3Lawrence Chan2mo
There's a few, for example the classic "Are CEOs Born Leaders?" [https://www.hbs.edu/ris/Publication%20Files/16-044_9c05278e-9d11-4315-a744-de008edf4d80.pdf] which uses the same Swedish data and finds a linear relationship of cognitive ability with both log company assets and log CEO pay, though it also concludes that the effect isn't super large. The main reason there aren't more is that we generally don't have good cognitive data on most CEOs. (There are plenty of studies looking at education attainment or other proxies.) You can see this trend in the Dal Bo et al Table cited in the main post as well. (As an aside, I'm a bit worried about the Swedish dataset, since the cognitive ability of Swedish large-firm CEOs is lower than Herrnstein and Murray (1996)'s estimated cognitive ability of 12.9 million Americans in managerial roles. Maybe something interesting happens with CEOs in Sweden?) It is very well established that certain CEOs are consistently better than others, i.e. CEO level fixed effects matter significantly to company performance across a broad variety of outcomes.

Mechanism 2: deceptive alignment

Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.

As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”

So gradient descent on the short-ter...

Mechanism 1: Shifting horizon length in response to short-horizon tampering

Suppose I want my AI to write good code (say to help me run my business). The AI understands a lot about how to write code, how servers work, and how users behave, learned entirely from quick feedback and experimentation. Let’s say it has a human-level or even subhuman understanding of the overall business and other long-term planning.

(This example may seem a bit silly if you imagine a software-writing AI in isolation, but you should think of the same story playing out all across an...

My perspective is:

• Planning against a utility function is an algorithmic strategy that people might use as a component of powerful AI systems. For example, they may generate several plans and pick the best or use MCTS or whatever. (People may use this explicitly on the outside, or it may be learned as a cognitive strategy by an agent.)
• There are reasons to think that systems using this algorithm would tend to disempower humanity. We would like to figure out how to similarly powerful AI systems that don't do that.
• We don't currently have candidate algorithms t
...
2Alex Turner2mo
I don't think we can or need to avoid planning per se. My position is more that certain design choices -- e.g. optimizing the output of a grader with a diamond-value, instead of actually having the diamond-value yourself -- force you to solving ridiculously hard subproblems, like robustness against adversarial inputs in the exponential-in-planning-horizon plan space. Just to set expectations, I don't have a proposal for capturing "the benefits of search without the risks"; if you give value-child bad values, he will kill you. But I have a proposal for how several apparent challenges (e.g. robustness to adversarial inputs proposed by the actor) are artifacts of e.g. the design patterns I outlined in this post. I'll outline why I think that realistic (e.g. not argmax) cognition/motivational structures automatically avoid these extreme difficulties.
6Alex Turner2mo
This seems like a misunderstanding. While I've previously communicated to you arguments about problems with manipulating embedded grading functions, that is not at all what this post is intended to be about. I'll edit the post to make the intended reading more obvious. None of this post's arguments rely on the grader being embedded and therefore physically manipulable. As I wrote in footnote 1: Anyways, replying in particular to: Open-ended domains are harder to grade robustly on all inputs because more stuff can happen, and the plan space gets exponentially larger since the branching factor is the number of actions. EG it's probably far harder to produce an emotionally manipulative-to-the-grader DOTA II game state (e.g. I look at it and feel compelled to output a ridiculously high number), than a manipulative state in the real world (which plays to e.g. their particular insecurities and desires, perhaps reminding them of triggering events from their past in order to make their judgments higher-variance).

I don't think that's the case.

I agree that the (unprompted) generative model is doing something kind of like: choose a random goal, then optimize it.

In some sense that does reflect the "plurality of realistic human goals." But I don't think it's a good way to reflect that diversity. It seems like you want to either (i) be able to pick which goal you pursue, (ii) optimize an aggregate of several goals.

Either way, I think that's probably best reflected by a deterministic reward function, and you'd probably prefer be mindful about what you are getting rather than randomly sampling from we...

For text-davinci-002 the goal is to have the model do what the user asked as well as it can, not to sample from possible worlds. For example, if the user asks "Is X true?" and the model's probability is 80%, the intended behavior is for the model to say "Probably" 100% of the time, not to say "Yes" 80% of the time and "No" 20% of the time.

This is often (usually?) the desired behavior. For pre-trained LMs people usually turn the temperature down (or use nucleus sampling or beam search or whatever) in order to get more reasonable behavior, but that introduce...

I think this is really exciting and I’m very interested see how it goes. I think the current set of problems and methodologies is solid enough that participants have a reasonable shot at making meaningful progress within a month. I also expect this to be a useful way to learn about language models and to generally be in a better position to think about alignment.

I think we’re still a long way from understanding model behavior well enough that we could e.g. rule out deceptive alignment, but it feels to me like recent work on LM interpretability is making re...

I'd summarize the core of this position as something like:

1. If everyone had access to smart AI, then a small group could destroy the world. (Or maybe even: a small group unconcerned with norms and laws could take over the world?)
2. It's unlikely that any politically viable form of defense / law enforcement / counter-terrorism would be able to prevent that.
3. So in order to prevent someone from destroying the world you need to more fundamentally change the structure of society prior to the broad accessibility of powerful AI.
4. One way you could do that in the nick of
...