Wiki Contributions


The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)

I'm not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There's no reward conditioning going on. They're also not really comparing like to like, since the AD and ED agents were trained on different data (RL learning trajectories vs. expert demonstrations). 

Like I mentioned in the post, my story about this is that the AD agents can get good performance by, when the previous episode ends with reward 1, navigating to the position that the previous episode ended in. (Remember, the goal position doesn't change from episode to episode -- these "tasks" are insanely narrow!) On the other hand, the ED agent probably just picks some goal position and repeatedly navigates there, never adjusting to the fact that it's not getting reward.

My recent post on generative models has some related discussion; see especially remark 1 on the satisficer, quantilizer, and optimizer approaches to making agents with generative models.

Two interesting differences between the approaches discussed here and in my linked post:

  • In my post, I assumed that the generative model was trained on a data set which included rewards (for example, humans playing Breakout, where the reward is provided by the environment; or a setting in which rewards can be provided by a reward model trained with human feedback). In contrast, you seem to be primarily considering settings, like language models trained on the internet, in which rewards are not provided (or at least, only provided implicitly in the sense that "here's some Nobel prize winning research: [description of research]" implicitly tells the model that the given research is the type of research that good researchers produce, and thus kinda acts as a reward label). My assumption trivializes the problem of making a quantilizer (since we can just condition on reward in the top 5% of previously observed rewards). But your assumption might be more realistic, in that the generative models we try to get superhuman performance out of won't be trained on data that includes rewards, unless we intentionally produce such data sets.
  • My post focuses on a particular technique for improving capabilities from baseline called online generative modeling; in this scheme, after pretraining, the generative model starts an online training phase in which episodes that it outputs are fed back into the generative model as new inputs. Over time, this will cause the distribution of previously-observed rewards to shift upwards, and with it the target quantile. Note that if the ideas you lay out here for turning a generative model into a quantilizer work, then you can stack online generative modeling on top. Why would you do this? It seems like you're worried that your techniques can safely produce the research of a pretty good biologist but not of the world's best biologist on their best day. One way around this is to just ask your generative model to produce the research of a pretty good biologist, but use the online generative modeling trick to let its expectation of what pretty good biology research looks like drift up over time. Would this be safer? I don't know, but it's at least another option.

When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer, and I'm quite happy to see that there's a lot of convergence between my private list and Paul's list here. 

I thought it would be a useful exercise to diff my list with Paul's; I'll record the result in the rest of this comment without the expectation that it's useful to anyone else.

Points on both lists:

  • Eliezer's "first critical try" framing downplays the importance of trial-and-error with non-critical tries.
  • It's not clear that a "pivotal act" by an aligned AI is the only way to prevent unaligned AI systems from being created.
  • Eliezer badly equivocates between "alignment is hard"/"approach X to alignment doesn't obviously solve it" and "alignment is impossible to solve within our time limit"/"approach X to alignment is doomed."
  • Deceptive behavior may arise from AI systems before they are able to competently deceive us, giving us some chances to iterate.
  • Eliezer's arguments for fast takeoffs aren't precise enough to warrant his confidence.
  • Eliezer's reasoning on generalization across distributional shift seems sloppy. Paul doesn't dig into this much, but I would add that there are approaches to reasoning about the inductive biases of ML systems which, together with Anthropic-esque empirical work on how things scale with capabilities, could give us some measure of confidence that a promising-looking alignment scheme will generalize.
  • Based on recent work, ML systems might be much more interpretable than Eliezer seems to think.
  • Eliezer doesn't seriously engage with any of the most promising approaches to alignment (and by his own admission, probably could not pass an ITT for them).
  • Debate-esque strategies for checking outputs of powerful AI systems aren't obviously doomed by Eliezer's concerns about coordination.
  • Eliezer's argument that it's impossible to train a powerful agent by imitating human thought seem bad.
  • Regarding the question "Why did no one but Eliezer write a List of Lethalities?" a pretty plausible answer is "because List of Lethalities was not an especially helpful document and other researchers didn't think writing it was a priority."

I won't try to list all of the things that Paul mentioned which weren't on my list, but some of the most useful (for me) were:

  • Eliezer's doomy stories often feature a superintelligent AI system which is vastly smarter than it needs to be in order to kill us, which is a bit unrealistic since these stories ought to be about the first AI which is powerful enough to attempt permanently disempowering us. To patch this story, you need to either imagine a less powerful system turning dangerous or humans having already made aligned systems up to the level of capabilities of the new dangerous system, both of which feel less scary than the classic "atomized by nanobots" stories.
  • AI systems will be disincentivized from hiding their capabilities, since we'll be trying to produce AI systems with powerful capabilities.
  • Approaches to alignment are disjunctive, so pessimistic cases need to seriously engage with an existential quantifier over the research humans (perhaps assisted by whatever AI research assistants we can safely produce) can perform in the coming ~decades.
  • Since R&D is out-of-distributions for humans, we might expect to have a comparative advantage in dealing with deception from AI systems.

Finally, a few points which were on my list and not Paul's, and which I feel like writing out:

  • "Consequentialist which plans explicitly using exotic decision theory" is not a likely shape for the first superintelligent AI systems to take, but many of Eliezer's doomy arguments seem to assume AI systems of that form. Now, it's true that the AI systems we build might figure out that agents of that form are especially powerful and invest time into trying to build them. But that's a problem we can hopefully leave to our aligned superintelligent research assistants; building such aligned research assistants seems much less doomed.
  • (This is a disagreement with both Paul and Eliezer.) Contra the view that capabilities will necessarily improve a lot before alignment failures start being a problem, it seems plausible to me that many commercial applications for AI might rely on solving alignment problems. You can't deploy your smart power grid if it keeps doing unsafe things.
  • Eliezer's view that you're not likely to make progress on alignment unless you figured out it was a problem by yourself seems insane to me. I can't think of any other research field like this ("you can't be expected to make progress on mitigating climate change unless you independently discovered that climate change would be a problem"), and I'm not sure where Eliezer's opinion that alignment is an exception is coming from.

Hmm, I'm not sure I understand -- it doesn't seem to me like noisy observations ought to pose a big problem to control systems in general.

For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we'll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don't see a sense in which the error "comes to dominate" the thing we're optimizing.

One concern which does make sense to me (and I'm not sure if I'm steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they're supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.

If this is your primary concern regarding Goodhart's Law, then I agree the model above doesn't obviously capture it. I guess it's more precisely a model of proxy misspecification.

This paper gives a mathematical model of when Goodharting will occur. To summarize: if

(1) a human has some collection  of things which she values,

(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and

(3) the robot can freely vary how much of  there are in the world, subject only to resource constraints that make the  trade off against each other,

then when the robot optimizes for its proxy utility, it will minimize all 's which its proxy utility function doesn't take into account. If you impose a further condition which ensures that you can't get too much utility by only maximizing some strict subset of the 's (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human's true utility function.

That said, I wasn't super-impressed by this paper -- the above is pretty obvious and the mathematical model doesn't elucidate anything, IMO.

Moreover, I think this model doesn't interact much with the skeptical take about whether Goodhart's Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn't take into account:

(1) Lots of the things we value are correlated with each other over "realistically attainable" distributions of world states. Or in other words, for many pairs  of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of  without also increasing the amount of .

(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.

If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won't be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you might additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually do believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)

This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.

It seems to me that the meaning of the set  of cases drifts significantly from when it is first introduced and the "Implications" section. It further seems to me that clarifying what exactly  is supposed to be resolves the claimed tension between the existence of iterably improvable ontology identifiers and difficulty of learning human concept boundaries.

Initially,  is taken to be a set of cases such that the question  has an objective, unambiguous answer. Cases where the meaning of  are ambiguous are meant to be discarded. For example, if  is the question "Is the diamond in the vault?" then, on my understanding,  ought to exclude cases where something happens which renders the concepts "the diamond" and "the vault" ambiguous, e.g. cases where the diamond is ground into dust.

In contrast, in the section "Implications," the existence of iterably improvably ontology identifiers is taken to imply that the resulting ontology identifier would be able to answer the question  posed in a much larger set of cases  in which the very meaning of  relies on unspecified facts about the state of the world and how they interact with human values. 

(For example, it seems to me that the authors think it implausible that an ontology identifier be able to answer a question like "Is the diamond in the vault?" in a case where the notion of "the vault" is ambiguous; the ontology identifier would need to first understand that what the human really wants to know is "Will I be able to spend my diamond?", reinterpret the former question in light of the latter, and then answer. I agree that an ontology identifier shouldn't be able to answer ambiguous and context-dependent questions like these, but it would seem to me that such cases should have been excluded from the set .)

To dig into where specifically I think the formal argument breaks down, let me write out (my interpretation) of the central claim on iterability in more detail. The claim is:

Claim: Suppose there exists an initial easy set  such that for any , we can find a predictor that does useful computation with respect to . Then we can find a reporter that answers all cases in  correctly.

This seems right to me (modulo more assumptions on "we can find," not-too-largeness of the sets, etc.). But crucially, since the hypothesis quantifies over all sets  such that , this hypothesis becomes stronger the larger  is. In particular, if  were taken to include cases where the meaning of  were fraught or context-dependent, then we should already have strong reason to doubt that this hypothesis is true (and therefore not be surprised when assuming the hypothesis produces counterintuitive results). 

(Note that the ELK document is sensitive to concerns about questions being philosophically fraught, and only considers narrow ELK for cases where questions have unambiguous answers. It also seems important that part of the set-up of ELK is that the reporter must "know" the right answers and "understand" the meanings of the questions posed in natural language (for some values of "know" and "understand") in order for us to talk about eliciting its knowledge at all.)