I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward.
I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers.
I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.
So for example, say Alice runs this experiment:
Train an agent A in an environment that contains the source B of A's reward.
Alice observes that A learns to hack B. Then she solves this as follows:
Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.
Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,
"Cool. But this won't generalize to future lethal systems because it doe...
...I don't currently think this is the case, and seems like the likely crux. In-general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in various fine-tuned ways (including preventing the AI from saying controversial things), which had been the biggest problem
How much total investment do you think there is in AI in 2023?
My guess is total investment was around the $200B - $500B range, with about $100B of that into new startups and organizations, and around $100-$400B of that in organizations like Google and Microsoft outside of acquisitions. I have pretty high uncertainty on the upper end here, since I don't know what fraction of Google's revenue gets reinvested again into AI, how much Tesla is investing in AI, how much various governments are investing, etc.
...How much variance do you think there is in the level o
The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.
...This could be helpful for "advertising" reasons, but I think my sense of how much this actually helps with the actual alignment problem correlates pretty strongly with how much A is shaped---in terms of how it got its capabilities---alike to future lethal systems. What are ways that the helpfuln
I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature. The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
...So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures imp
I don't think this is related to RLHF.
I am very confused why you think this, just right after the success of Chat-GPT, where approximately the only difference from GPT-3 was the presence of RLHF.
I think the qualitative difference between the supervised tuning done in text-davinci-002 and the RLHF in text-davinci-003 is modest (e.g. I've seen head-to-head comparisons suggesting real but modest effects on similar tasks).
I think the much more important differences are:
I think the effect would have been very similar if it had been trained via supervised learning on good dialogs
I don't currently think this is the case, and seems like the likely crux. In general it seems that RLHF is substantially more flexible in what kind of target task it allows you to train for, which is the whole reason for why you are working on it, and at least my model of the difficulty of generating good training data for supervised learning here is that it would have been a much greater pain, and would have been much harder to control in vario...
If you don't like AI systems doing tasks that humans can't evaluate, I think you should be concerned about the fact that people keep building larger models and fine-tuning them in ways that elicit intelligent behavior.
Indeed, I think current scaling up of language models is likely net negative (given our current level of preparedness) and will become more clearly net negative over time as risks grow. I'm very excited about efforts to monitor and build consensus about these risks, or to convince or pressure AI labs to slow down development as further scalin...
I understand your point of view and think it is reasonable.
However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other. I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.
I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL...
I definitely agree that this sounds like a really bizarre sort of model and it seems like we should be able to rule it out one way or another. If we can't then it suggests a different source of misalignment from the kind of thing I normally worry about.
So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?
Mostly--the opaque test is something like an obfuscated physics simulation, and so it tells you if things look good. So you try a bunch of random actions until you get one where things look good. But if you can't understand the simulat...
The thing I'm concerned about is: the AI can predict that Carmichael numbers look prime (indeed it simply runs the Fermat test on each number). So it can generate lots of random candidate actions (or search through actions) until it finds one that looks prime.
Similarly, your AI can consider lots of actions until it finds one that it predicts will look great, then execute that one. So you get sensor tampering.
I'm not worried about cases like the etheric interference, because the AI won't select actions that exploit etheric interference (since it can't predi...
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I'm objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly ca...
The approach in this post is quite similar to what we talked about in the "narrow elicitation" appendix of ELK, I found it pretty interesting to reread it today (and to compare the old strawberry appendix to the new strawberry appendix). The main changes over the last year are:
Your overall picture sounds pretty similar to mine. A few differences.
Right now I'm trying to either:
I'm a bit skeptical about calling this an "AI governance" problem. This sounds more like "governance" or maybe "existential risk governance"---if future technologies make irreversible destruction increasingly easy, how can we govern the world to avoid certain eventual doom?
Handling that involves political challenges, fundamental tradeoffs, institutional design problems, etc., but I don't think it's distinctive to risks posed by AI, don't think that a solution necessarily involves AI, don't think it's right to view "access to TAI" as the only or primary lev...
I agree there are all kinds of situations where the generalization of "reward" is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data.
It's possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to.
As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.
This is incredibly weak evidence.
Both of those observations have high probability, so they aren't significant Bayesian evidence for "RL tends to produce external goals by default."
In particular, for this to be evidence for Richard's claim, you need to say: "If RL tended to produce systems that care about reward, then RL would b...
Is the reason that you expect AI developer margins to be reasonable that you expect the small number of AI developers to still compete with each other on price and thereby erode each other's margins?
Yes.
What if they were to form a cartel/monopoly? Being the only source of cheaper and/or smarter than human labor would be extremely profitable, right?
A monopoly on computers or electricity could also take big profits in this scenario. I think the big things are always that it's illegal and that high prices drive new entrants.
...but AI developers could implicitly
Presumably if most customers are able to find companies offering AIs that align sufficiently with their own preferences, there would be no backlash.
I don't really think that's the case.
Suppose that I have different taste from most people, and consider the interior of most houses ugly. I can be unhappy about the situation even if I ultimately end up in a house I don't think is ugly. I'm unhappy that I had to use multiple bits of selection pressure just to avoid ugly interiors, and that I spend time in other people's ugly houses, and so on.
In practice ...
We are mostly thinking about interpretability and anomaly detection designed to resolve two problems (see here):
I'm also most nervous about this way of modeling limitation (2)/(3), since it seems like it leads directly to the conclusion "fine-tuning always trades off truthfulness and persuasion, but conditioning can improve both."
Note that in this example your model is unable to sample from the conditional you specified, since it is restricted to . In this regime truthfulness and persuasiveness are anticorrelated because of a capability constraint of your model, it just literally isn't able to increase both at the same time, and conditioning can do better because you are generating lots of samples and picking the best.
(You point this out in your comment, but it seems worth emphasizing. As you say, if you do RL with a KL penalty, then the capability limit is the only way...
Thanks for writing, I mostly agree. I particularly like the point that it's exciting to study methods for which "human level" vs "subhuman level" isn't an important distinction. One of my main reservations is that this distinction can be important for language models because the pre-training distribution is at human level (as you acknowledge).
I mostly agree with your assessment of difficulties and am most concerned about worry 2, especially once we no longer have a pre-training distribution anchoring their beliefs to human utterances. So I'm particularly i...
I'm not very convinced by this comment as an objection to "50% AI grabs power to get reward." (I find it more plausible as an objection to "AI will definitely grab power to get reward.")
I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept and not closely related to the direct observations that policies are going to receive
"Reward" is not a very natural concept
This seems to be most of your position but I'm skeptical (and it's kind of just asserted without argument):
I think that some near-future applications of AI alignment are plausible altruistic top priorities. Moreover, even when people disagree with me about prioritization, I think that people who want to use AI to accomplish contemporary objectives are important users. It's good to help them, understand the difficulties they encounter, and so on, both to learn from their experiences and make friends.
So overall I think I agree with the most important claims in this post.
Despite that, I think it's important for me personally (and for ARC) to be clear about what I ...
To be clear, I don't envy the position of anyone who is trying to deploy AI systems and am not claiming anyone is making mistakes. I think they face a bunch of tricky decisions about how a model should behave, and those decisions are going to be subject to an incredible degree of scrutiny because they are relatively transparent (since anyone can run the model a bunch of times to characterize its behavior).
I'm just saying that how you feel about AI alignment shouldn't be too closely tied up with how you end up feeling about those decisions. There are many a...
Here is a question closely related to the feasibility of finding discriminating-reasons (cross-posted from Facebook):
For some circuits C it’s meaningful to talk about “different mechanisms” by which C outputs 1.
A very simple example is C(x) := A(x) or B(x). This circuit can be 1 if either A(x) = 1 or B(x) = 1, and intuitively those are two totally different mechanisms.
A more interesting example is the primality test C(x, n) := (x^n = x (mod n)). This circuit is 1 whenever n is a prime, but it can also be 1 “by coincidence” e.g if n is a Carmichael number. ...
This approach requires solving a bunch of problems that may or may not be solvable—finding a notion of mechanistic explanation with the desired properties, evaluating whether that explanation “applies” to particular inputs, bounding the number of sub-explanations so that we can use them for anomaly detection without false positives, efficiently finding explanations for key model behaviors, and so on. Each of those steps could fail. And in practice we are pursuing a much more specific approach to formalizing mechanistic explanations as probabilistic heurist...
I'm convinced, I relabeled it.
Do the scientists ever need to know how the game of life works, or can the heuristic arguments they find remain entirely opaque?
The scientists don't start off knowing how the game of life works, but they do know how their model works.
The scientists don't need to follow along with the heuristic argument, or do any ad hoc work to "understand" that argument. But they could look at the internals of the model and follow along with the heuristic argument if they wanted to, i.e. it's important that their methods open up the model even if they never do.
Intuitively...
If you gave a language model the prompt: "Here is a dialog between a human and an AI assistant in which the AI never says anything offensive," and if the language model made reasonable next-token predictions, then I'd expect to see the "non-myopic steering" behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as "the AI never says anything offensive"). But it seems like your definition is trying to classify that language model as myopic. So it's less clear to me if this experiment...
Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.
I think most people's intuitions come from more everyday experiences like:
These observations seem relevant to questions like "can we delegate work to AI" because they are ubiquitous in everyday situations where we want to delegate work.
The claim in this post seems to be: sometimes it's easier to create an object with property P than to decide ...
I don't think the generalization of the OP is quite "sometimes it's easier to create an object with property P than to decide whether a borderline instance satisfies property P". Rather, the halting example suggests that verification is likely to be harder than generation specifically when there is some (possibly implicit) adversary. What makes verification potentially hard is the part where we have to quantify over all possible inputs - the verifier must work for any input.
Borderline cases are an issue for that quantifier, but more generally any sort of a...
deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC's post on anomaly detection), so is included in π.
I'm not sure I fully understand this example, but I think it's fine. The idea is:
I'm very interested in understanding whether anything like your scenario can happen. Right now it doesn't look possible to me. I'm interested in attempting to make such scenarios concrete to the extent that we can now, to see where it seems like they might hold up. Handling the issue more precisely seems bottlenecked on a clearer notion of "explanation."
Right now by "explanation" I mean probabilistic heuristic argument as described here.
...A problem with this: π can explain the predictions on both train and test distributions without all the test inputs
The general strategy I'm describing for anomaly detection is:
Yes, you want the patient to appear on camera for the normal reason, but you don't want the patient to remain healthy for the normal reason.
We describe a possible strategy for handling this issue in the appendix. I feel more confident about the choice of research focus than I do about whether that particular strategy will work out. The main reasons are: I think that ELK and deceptive alignment are already challenging and useful to solve even in the case where there is no such distributional shift, that those challenges capture at least some central alignme...
There isn't supposed to be a second AI.
In the object-level diamond example, we want to know that the AI is using "usual reasons" type decision-making.
In the object-level diamond situation, we have a predictor of "does the diamond appear to remain in the vault," we have a proposed action and predict that if we take it the diamond will appear to remain in the vault, and we want to know whether the diamond appears to remain in the vault for the normal reason.
For simplicity, when talking about ELK in this post or in the report, we are imagining literally selec...
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It's not enough for it to be plausible that it could happen often; it needs to happen all the time.
I think the situation is much better if deceptive alignment is inconsistent. I also think that's more likely, particularly if we are trying.
That said, I don't think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are tr...
Mechanism 2: deceptive alignment
Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.
As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”
So gradient descent on the short-ter...
I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it's very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it's more likely than not that our predictions about what these inductive biases will look like are pretty of...
Mechanism 1: Shifting horizon length in response to short-horizon tampering
Suppose I want my AI to write good code (say to help me run my business). The AI understands a lot about how to write code, how servers work, and how users behave, learned entirely from quick feedback and experimentation. Let’s say it has a human-level or even subhuman understanding of the overall business and other long-term planning.
(This example may seem a bit silly if you imagine a software-writing AI in isolation, but you should think of the same story playing out all across an...
Thanks for posting, I thought this was interesting and reasonable.
Some points of agreement:
Mechanism 2: deceptive alignment
Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.
As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”
So gradient descent on the short-ter...
Mechanism 1: Shifting horizon length in response to short-horizon tampering
Suppose I want my AI to write good code (say to help me run my business). The AI understands a lot about how to write code, how servers work, and how users behave, learned entirely from quick feedback and experimentation. Let’s say it has a human-level or even subhuman understanding of the overall business and other long-term planning.
(This example may seem a bit silly if you imagine a software-writing AI in isolation, but you should think of the same story playing out all across an...
My perspective is:
I don't think that's the case.
I agree that the (unprompted) generative model is doing something kind of like: choose a random goal, then optimize it.
In some sense that does reflect the "plurality of realistic human goals." But I don't think it's a good way to reflect that diversity. It seems like you want to either (i) be able to pick which goal you pursue, (ii) optimize an aggregate of several goals.
Either way, I think that's probably best reflected by a deterministic reward function, and you'd probably prefer be mindful about what you are getting rather than randomly sampling from we...
For text-davinci-002 the goal is to have the model do what the user asked as well as it can, not to sample from possible worlds. For example, if the user asks "Is X true?" and the model's probability is 80%, the intended behavior is for the model to say "Probably" 100% of the time, not to say "Yes" 80% of the time and "No" 20% of the time.
This is often (usually?) the desired behavior. For pre-trained LMs people usually turn the temperature down (or use nucleus sampling or beam search or whatever) in order to get more reasonable behavior, but that introduce...
I think this is really exciting and I’m very interested see how it goes. I think the current set of problems and methodologies is solid enough that participants have a reasonable shot at making meaningful progress within a month. I also expect this to be a useful way to learn about language models and to generally be in a better position to think about alignment.
I think we’re still a long way from understanding model behavior well enough that we could e.g. rule out deceptive alignment, but it feels to me like recent work on LM interpretability is making re...
I'd summarize the core of this position as something like:
I didn't realize how broadly you were defining AI investment. If you want to say that e.g ChatGPT increased investment by $10B out of $200-500B, so like +2-5%, I'm probably happy to agree (and I also think it had other accelerating effects beyond that).
I would guess that a 2-5% increase in total investment could speed up AGI timelines 1-2 weeks depending on details of the dynamics, like how fast investment was growing, how much growth is exogenous vs endogenous, diminishing returns curves, importance of human capital, etc.. If you mean +2-5% investment in ... (read more)