AI ALIGNMENT FORUM
AF

Jacob Hilton
Ω4939360
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4Jacob_Hilton's Shortform
4mo
1
No wikitag contributions to display.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Jacob_Hilton2mo160

It is interesting to note how views on this topic have shifted with the rise of outcome-based RL applied to LLMs. A couple of years ago, the consensus in the safety community was that process-based RL should be prioritized over outcome-based RL, since it incentivizes choosing actions for reasons that humans endorse. See for example Anthropic's Core Views On AI Safety:

Learning Processes Rather than Achieving Outcomes

One way to go about learning a new task is via trial and error – if you know what the desired final outcome looks like, you can just keep trying new strategies until you succeed. We refer to this as “outcome-oriented learning”. In outcome-oriented learning, the agent’s strategy is determined entirely by the desired outcome and the agent will (ideally) converge on some low-cost strategy that lets it achieve this.

Often, a better way to learn is to have an expert coach you on the processes they follow to achieve success. During practice rounds, your success may not even matter that much, if instead you can focus on improving your methods. As you improve, you might shift to a more collaborative process, where you consult with your coach to check if new strategies might work even better for you. We refer to this as “process-oriented learning”. In process-oriented learning, the goal is not to achieve the final outcome but to master individual processes that can then be used to achieve that outcome.

At least on a conceptual level, many of the concerns about the safety of advanced AI systems are addressed by training these systems in a process-oriented manner. In particular, in this paradigm:

  • Human experts will continue to understand the individual steps AI systems follow because in order for these processes to be encouraged, they will have to be justified to humans.
  • AI systems will not be rewarded for achieving success in inscrutable or pernicious ways because they will be rewarded only based on the efficacy and comprehensibility of their processes.
  • AI systems should not be rewarded for pursuing problematic sub-goals such as resource acquisition or deception, since humans or their proxies will provide negative feedback for individual acquisitive processes during the training process.

At Anthropic we strongly endorse simple solutions, and limiting AI training to process-oriented learning might be the simplest way to ameliorate a host of issues with advanced AI systems. We are also excited to identify and address the limitations of process-oriented learning, and to understand when safety problems arise if we train with mixtures of process and outcome-based learning. We currently believe process-oriented learning may be the most promising path to training safe and transparent systems up to and somewhat beyond human-level capabilities.

Or Solving math word problems with process- and outcome-based feedback (DeepMind, 2022):

Second, process-based approaches may facilitate human understanding because they select for reasoning steps that humans understand. By contrast, outcome-based optimization may find hard-to-understand strategies, and result in less understandable systems, if these strategies are the easiest way to achieve highly-rated outcomes. For example in GSM8K, when starting from SFT, adding Final-Answer RL decreases final-answer error, but increases (though not significantly) trace error.

[...]

In contrast, consider training from process-based feedback, using user evaluations of individual
actions, rather than overall satisfaction ratings. While this does not directly prevent actions which
influence future user preferences, these future changes would not affect rewards for the corresponding
actions, and so would not be optimized for by process-based feedback. We refer to Kumar et al. (2020)
and Uesato et al. (2020) for a formal presentation of this argument. Their decoupling algorithms
present a particularly pure version of process-based feedback, which prevent the feedback from
depending directly on outcomes.

Or Let's Verify Step by Step (OpenAI, 2023):

Process supervision has several advantages over outcome supervision related to AI alignment. Process supervision is more likely to produce interpretable reasoning, since it encourages models to follow a process endorsed by humans. Process supervision is also inherently safer: it directly rewards an aligned chain-of-thought rather than relying on outcomes as a proxy for aligned behavior (Stuhlmüller and Byun, 2022). In contrast, outcome supervision is harder to scrutinize, and the preferences conveyed are less precise. In the worst case, the use of outcomes as an imperfect proxy could lead to models that become misaligned after learning to exploit the reward signal (Uesato et al., 2022; Cotra, 2022; Everitt et al., 2017).

In some cases, safer methods for AI systems can lead to reduced performance (Ouyang et al., 2022; Askell et al., 2021), a cost which is known as an alignment tax. In general, any alignment tax may hinder the adoption of alignment methods, due to pressure to deploy the most capable model. Our results show that process supervision in fact incurs a negative alignment tax. This could lead to increased adoption of process supervision, which we believe would have positive alignment side-effects. It is unknown how broadly these results will generalize beyond the domain of math, and we consider it important for future work to explore the impact of process supervision in other domains.

It seems worthwhile to reflect on why this perspective has gone out of fashion:

  • The most obvious reason is the success of outcome-based RL, which seems to be outperforming processed-based RL. Advocating for processed-based RL no longer makes much sense when it is uncompetitive.
  • Outcome-based RL also isn't (yet) producing the kind of opaque reasoning that proponents of process-based RL may have been worried about. See for example this paper for a good analysis of the extent of current chain-of-thought faithfulness.
  • Outcome-based RL is leading to plenty of reward hacking, but this is (currently) fairly transparent from chain of thought, as long as this isn't optimized against. See for example the analysis in this paper.

Some tentative takeaways:

  • There is strong pressure to walk over safety-motivated lines in the sand if (a) doing so is important for capabilities and/or (b) doing so doesn't pose a serious, immediate danger. People should account for this when deciding what future lines in the sand to rally behind. (I don't think using outcome-based RL was ever a hard red line, but it was definitely a line of some kind.)
  • In particular, I wouldn't be optimistic about attempting to rally behind a line in the sand like "don't optimize against the chain of thought", since I'd expect people to blow past this as quickly about as they blew past "don't optimize for outcomes" if and when it becomes substantially useful. N.B. I thought the paper did a good job of avoiding this pitfall, focusing instead on incorporating the potential safety costs into decision-making.
  • It can be hard to predict how dominant training techniques will evolve, and we should be wary of anchoring too hard on properties of models that are contingent on them. I would not be surprised if the "externalized reasoning property" (especially "By default, humans can understand this chain of thought") no longer holds in a few years, even if capabilities advance relatively slowly (indeed, further scaling of outcome-based RL may threaten it). N.B. I still think the advice in the paper makes sense for now, and could end up mattering a lot – we should just expect to have to revise it.
  • More generally, people designing "if-then commitments" should be accounting for how the state of the field might change, perhaps by incorporating legitimate ways for commitments to be carefully modified. This option value would of course trade off against the force of the commitment.
Reply1
Jacob_Hilton's Shortform
Jacob_Hilton4mo*40

I recently gave this talk at the Safety-Guaranteed LLMs workshop:

The talk is about ARC's work on low probability estimation (LPE), covering:

  • Theoretical motivation for LPE and (towards the end) activation modeling approaches (both described here)
  • Empirical work on LPE in language models (described here)
  • Recent work-in-progress on theoretical results
Reply
A bird's eye view of ARC's research
Jacob_Hilton10mo20

It sounds like we are not that far apart here. We've been doing some empirical work on toy systems to try to make the leap from mechanistic interpretability "stories" to semi-formal heuristic explanations. The max-of-k draft is an early example of this, and we have more ambitious work in progress along similar lines. I think of this work in a similar way to you: we are not trying to test empirical assumptions (in the way that some empirical work on frontier LLMs is, for example), but rather to learn from the process of putting our ideas into practice.

Reply
Backdoors as an analogy for deceptive alignment
Jacob_Hilton10mo80

For those who are interested in the mathematical details, but would like something more accessible than the paper itself, see this talk I gave about the paper:

Reply
A bird's eye view of ARC's research
Jacob_Hilton10mo284

Thank you – this is probably the best critique of ARC's research agenda that I have read since we started working on heuristic explanations. This level of thoughtfulness in external feedback is very rare and I'm grateful for the detail and clarity you put into it. I don't think my response fully rebuts your central concern, but hopefully it gives a sense of my current thinking about it.

It sounds like we are in agreement that something very loosely heuristic explanation-flavored (interpreted so broadly as to include mechanistic interpretability, for example) can reasonably be placed at the root of the diagram, by which I mean that it's productive to try to explain neural network behaviors in this very loose sense, attempt to apply such explanations to downstream applications such as MAD/LPE/ELK etc. We begin to diverge, I think, about the extent to which ARC should focus on a more narrow conception of heuristic explanations. From least to most specific:

  1. Any version that is primarily mathematical rather than "story-centric"
  2. Some (mathematical) version that is consistent with our information-theoretic intuitions about what constitutes a valid explanation (i.e., in the sense of something like surprise accounting)
  3. Some such version that is loosely based on independence assumptions
  4. Some version that satisfies more specific desiderata for heuristic estimators (such as the ones discussed in the paper linked in (3), or in this more recent paper)

Opinions at ARC will differ, but (1) I feel pretty comfortable defending, (2) I think is quite a promising option to be considering, (3) seems like a reasonable best guess but I don't think we should be that wedded to it, and (4) I think is probably too specific (and with the benefit of hindsight I think we have focused too much on this in the past). ARC's research has actually been trending in the "less specific" direction over time, as should hopefully be evident from our most recent write-ups (with the exception of our recent paper on specific desiderata, which mostly covers work done in 2023), and I am quite unsure exactly where we should settle on this axis.

By contrast, my impression is that you would not really defend even (1) (although I am curious exactly where you come down this axis, if you want to clarify). So I'll give what I see as the basic case for searching for a mathematical rather than a "story-centric" approach:

  • Mechanistic interpretability has so far yielded very little in the way of beating baselines at downstream tasks (this has been discussed at length elsewhere, see for example here, here and here), so I think it should still be considered a largely unproven approach (to be clear, this is roughly my view of all alignment approaches that aren't already in active use at labs, including ARC's, and I remain excited to see people's continued valiant attempts; my point is that the bar is low and a portfolio approach is appropriate).
  • Relying purely on stories clearly doesn't work at sufficient scale under worst-case assumptions (because the AI will have concepts you don't have words for), and there isn't a lot of evidence that this isn't indeed already a bottleneck in practice (i.e., current AIs may well already have concepts you don't have words for).
  • I think that ARC's worst-case, theoretical approach (described at zoom level 1) is an especially promising alternative to iterative, empirically-driven work. I think empirical approaches are more promising overall, but have correlated failure modes (namely, they could end up relying on correlated empirical contingencies that later turn out to be false), and have far more total effort going into them (arguably disproportionately so). Conditional on taking such an approach, story-centric methods don't seem super viable (how should one analyze stories theoretically?).
  • I don't really buy the argument that because a system has a lot of complexity, it can only be analyzed in ad-hoc ways. It seems to me that an analogous argument would have failed to make good predictions about the bitter lesson (i.e., by arguing that a simple algorithm like SGD should not be capable of producing great complexity in a targeted way). Instead, because neural nets are trained in an incremental, automated way based on mathematical principles, it seems quite possible to me that we can find explanations for them in a similar way (which is not an argument that can be applied to biological brains).

This doesn't of course defend (2)–(4) (which I would only want to do more weakly in any case). We've tried to get our intuitions for those across in our write-ups (as linked in (2)–(4) above), but I'm not sure there's anything succinct I can add here if those were unconvincing. I agree that puts us in the rather unfortunate position of sharing a reference class with Stephen Wolfram to many external observers (although hopefully our claims are not quite so overstated).

I think it's important for ARC to recognize this tension, and to strike the right balance between making our work persuasive to external skeptics on the one hand, and having courage in our convictions on the other hand (I think both have been important virtues in scientific development historically). Concretely, my current best guess is that ARC should:

  • (a) Avoid being too wedded to intuitive desiderata for heuristic explanations that we can't directly tie back to specific applications
  • (b) Search for concrete cases that put our intuitions to the test, so that we can quickly reach a point where either we no longer believe in them, or they are more convincing to others
  • (c) Also pursue research that is more agnostic to the specific form of explanation, such as work on low probability estimation or other applications
  • (d) Stay on the lookout for ideas from alternative theoretical approaches (including singular learning theory, sparsity-based approaches, computational mechanics, causal abstractions, and neural net-oriented varieties of agent foundations), although my sense is that object-level intuitions here just differ enough that it's difficult to collaborate productively. (Separately, I'd argue that proponents of all these alternatives are in a similar predicament, and could generally be doing a better job on analogous versions of (a)–(c).)

I think we have been doing all of (a)–(d) to some extent already, although I imagine you would argue that we have not been going far enough. I'd be interested in more thoughts on how to strike the right balance here.

Reply
Formal verification, heuristic explanations and surprise accounting
Jacob_Hilton1y30

Yes, I think the most natural way to estimate total surprise in practice would be to use sampling like you suggest. You could try to find the best explanation for "the model does $bad_thing with probability less than 1 in a million" (which you believe based on sampling) and then see how unlikely $bad_thing is according to the resulting explanation. In the Boolean circuit worked example, the final 23-bit explanation is likely still the best explanation for why the model outputs TRUE on at least 99% of inputs, and we can use this explanation to see that the model actually outputs TRUE on all inputs.

Another possible approach is analogous to fine-tuning. You could start by using surprise accounting to find the best explanation for "the loss of the model is L" (where L is estimated during training), which should incentivize rich explanations of the model's behavior in general. Then to estimate the probability that model does some rare $bad_thing, you could "fine-tune" your explanation using an objective that encourages it to focus on the relevant tails of the distribution. We have more ideas about estimating the probability of events that are too rare to estimate via sampling, and have been considering objectives other than surprise accounting for this. We plan to share these ideas soon.

Reply
Common misconceptions about OpenAI
Jacob_Hilton2y*8-4Review for 2022 Review

Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.

My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.

Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.

EDIT: See also this retrospective, posted later in May 2024.

Reply
Mode collapse in RL may be fueled by the update equation
Jacob_Hilton2y70

I think KL/entropy regularization is usually used to prevent mode collapse partly because it has nice theoretical properties. In particular, it is easy to reason about the optimal policy for the regularized objective - see for example the analysis in the paper Equivalence Between Policy Gradients and Soft Q-Learning.

Nevertheless, action-dependent baselines do appear in the literature, although the story is a bit confusing. This is my understanding of it from some old notes:

  • The idea was explored in Q-Prop. But unlike you, their intention was not to change the optimal policy, but rather to reduce the variance of the policy gradient. Therefore they also incorporated an additional term to cancel out the bias introduced by the action-dependent baseline. (Incidentally, perhaps this analysis is also relevant to understanding ACTDE.)
  • Later, The Mirage of Action-Dependent Baselines showed that in fact the variance reduction due the action-dependent baseline was negligible, and the entire benefit of Q-Prop was essentially due to a bug! The implementation normalized advantage estimates, but failed to apply the same adjustment to the bias-correction term, which turned out to be independently helpful because it's essentially the DDPG training objective.
Reply
ARC is hiring theoretical researchers
Jacob_Hilton2y*30

The questions on the take-home test vary in difficulty but are generally easier than olympiad problems, and should be accessible to graduates with relevant background. However, it is important to note that we are ultimately interested in research ability rather than the ability to solve self-contained problems under timed conditions. So although the take-home test forms part of our assessment, we also look at other signals such as research track-record (while recognizing that assessing research ability is unfortunately very hard).

(Note: I am talking about the current version of the test, it's possible that the difficulty will change as we refine our interview process.)

Reply
The effect of horizon length on scaling laws
Jacob_Hilton3y30

I think the direction depends on what your expectations were – I'll try to explain.

First, some terminology: the term "horizon length" is used in the paper to refer to the number of timesteps over which the algorithm pays attention to rewards, as governed by the discount rate. In the biological anchors framework, the term "effective horizon length" is used to refer to a multiplier on the number of samples required to train the model, which is influenced by the horizon length and other factors. For clarity, I'll using the term "scaling multiplier" instead of "effective horizon length" in this comment. The paper studies the effect of the horizon length on the scaling multiplier in a toy MNIST setting.

One key takeaway is that the scaling multiplier is not simply proportional to the horizon length, as one might have naively expected. Instead, the number of samples required is the sum of two components, one that is inherent to the task and independent of the horizon length, and one that is proportional to the horizon length. Compared to the naive expectation, this means that training compute requirements are lower. On the other hand, this ignores reward sparsity, so you might expect training compute requirements to be higher once both horizon length and reward sparsity are accounted for.

The paper also lends some support to the modeling assumptions of the neural network anchor, by validating the hypotheses that (a) training compute requirements still scale as a power law in model size for reinforcement learning, and with a similar exponent, and (b) the scaling multiplier can indeed vary a lot between environments. This might make you put more weight on the neural network anchor, which could again have either directional effect.

The other takeaways are more methodological and I don't think have much of a directional effect.

Reply
Load More
4Jacob_Hilton's Shortform
4mo
1
57A bird's eye view of ARC's research
10mo
11
57Backdoors as an analogy for deceptive alignment
1y
1
79Formal verification, heuristic explanations and surprise accounting
1y
3
58ARC is hiring theoretical researchers
2y
5
13The effect of horizon length on scaling laws
3y
2
53Scaling Laws for Reward Model Overoptimization
3y
5
64Common misconceptions about OpenAI
3y
75
18How much alignment data will we need in the long run?
3y
10
17Deep learning curriculum for large language model alignment
3y
3
Load More