The paper is frustratingly vague about what their context lengths are for the various experiments, but based off of comparing figures 7 and 4, I would guess that the context length for Watermaze was 1-2 times as long as an episode length(=50 steps). (It does indeed look like they were embedding the 2d dark room observations into a 64-dimensional space, which is hilarious.)
I'm not sure I understand your second question. Are you asking about figure 4 in the paper (the same one I copied into this post)? There's no reward conditioning going on. They're also not really comparing like to like, since the AD and ED agents were trained on different data (RL learning trajectories vs. expert demonstrations).
Like I mentioned in the post, my story about this is that the AD agents can get good performance by, when the previous episode ends with reward 1, navigating to the position that the previous episode ended in. (Remember, the goal position doesn't change from episode to episode -- these "tasks" are insanely narrow!) On the other hand, the ED agent probably just picks some goal position and repeatedly navigates there, never adjusting to the fact that it's not getting reward.
My recent post on generative models has some related discussion; see especially remark 1 on the satisficer, quantilizer, and optimizer approaches to making agents with generative models.
Two interesting differences between the approaches discussed here and in my linked post:
When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer, and I'm quite happy to see that there's a lot of convergence between my private list and Paul's list here.
I thought it would be a useful exercise to diff my list with Paul's; I'll record the result in the rest of this comment without the expectation that it's useful to anyone else.
Points on both lists:
I won't try to list all of the things that Paul mentioned which weren't on my list, but some of the most useful (for me) were:
Finally, a few points which were on my list and not Paul's, and which I feel like writing out:
Hmm, I'm not sure I understand -- it doesn't seem to me like noisy observations ought to pose a big problem to control systems in general.
For example, suppose we want to minimize the number of mosquitos in the U.S., and we access to noisy estimates of mosquito counts in each county. This may result in us allocating resources slightly inefficiently (e.g. overspending resources on counties that have fewer mosquitos than we think), but we'll still always be doing the approximately correct thing and mosquito counts will go down. In particular, I don't see a sense in which the error "comes to dominate" the thing we're optimizing.
One concern which does make sense to me (and I'm not sure if I'm steelmanning your point or just saying something completely different) is that under extreme optimization pressure, measurements might become decoupled from the thing they're supposed to measure. In the mosquito example, this would look like us bribing the surveyors to report artificially low mosquito counts instead of actually trying to affect real-world mosquito counts.
If this is your primary concern regarding Goodhart's Law, then I agree the model above doesn't obviously capture it. I guess it's more precisely a model of proxy misspecification.
This paper gives a mathematical model of when Goodharting will occur. To summarize: if
(1) a human has some collection s1,…,sn of things which she values,
(2) a robot has access to a proxy utility function which takes into account some strict subset of those things, and
(3) the robot can freely vary how much of s1,…,sn there are in the world, subject only to resource constraints that make the si trade off against each other,
then when the robot optimizes for its proxy utility, it will minimize all si's which its proxy utility function doesn't take into account. If you impose a further condition which ensures that you can't get too much utility by only maximizing some strict subset of the si's (e.g. assuming diminishing marginal returns), then the optimum found by the robot will be suboptimal for the human's true utility function.
That said, I wasn't super-impressed by this paper -- the above is pretty obvious and the mathematical model doesn't elucidate anything, IMO.
Moreover, I think this model doesn't interact much with the skeptical take about whether Goodhart's Law implies doom in practice. Namely, here are some things I believe about the world which this model doesn't take into account:
(1) Lots of the things we value are correlated with each other over "realistically attainable" distributions of world states. Or in other words, for many pairs si,sj of things we care about, it is hard (concretely, requires a very capable AI) to increase the amount of si without also increasing the amount of sj.
(2) The utility functions of future AIs will be learned from humans in such a way that as the capabilities of AI systems increase, so will their ability to model human preferences.
If (1) is true, then for each given capabilities level, there is some room for error for our proxy utility functions (within which an agent at that capabilities level won't be able to decouple our proxy utility function from our true utility function); this permissible error margin shrinks with increasing capabilities. If you buy (2), then you might additionally think that the actual error margin between learned proxy utility functions and our true utility function will shrink more rapidly than the permissible error margin as AI capabilities grow. (Whether or not you actually do believe that value learning will beat capabilities in this race probably depends on a whole lot of other empirical beliefs, or so it seems to me.)
This thread (which you might have already seen) has some good discussion about whether Goodharting will be a big problem in practice.
It seems to me that the meaning of the set C of cases drifts significantly from when it is first introduced and the "Implications" section. It further seems to me that clarifying what exactly C is supposed to be resolves the claimed tension between the existence of iterably improvable ontology identifiers and difficulty of learning human concept boundaries.
Initially, C is taken to be a set of cases such that the question Q has an objective, unambiguous answer. Cases where the meaning of Q are ambiguous are meant to be discarded. For example, if Q is the question "Is the diamond in the vault?" then, on my understanding, C ought to exclude cases where something happens which renders the concepts "the diamond" and "the vault" ambiguous, e.g. cases where the diamond is ground into dust.
In contrast, in the section "Implications," the existence of iterably improvably ontology identifiers is taken to imply that the resulting ontology identifier would be able to answer the question Q posed in a much larger set of cases C′ in which the very meaning of Q relies on unspecified facts about the state of the world and how they interact with human values.
(For example, it seems to me that the authors think it implausible that an ontology identifier be able to answer a question like "Is the diamond in the vault?" in a case where the notion of "the vault" is ambiguous; the ontology identifier would need to first understand that what the human really wants to know is "Will I be able to spend my diamond?", reinterpret the former question in light of the latter, and then answer. I agree that an ontology identifier shouldn't be able to answer ambiguous and context-dependent questions like these, but it would seem to me that such cases should have been excluded from the set C.)
To dig into where specifically I think the formal argument breaks down, let me write out (my interpretation) of the central claim on iterability in more detail. The claim is:
Claim: Suppose there exists an initial easy set E0⊆C such that for any E0⊂E⊊C, we can find a predictor that does useful computation with respect to E. Then we can find a reporter that answers all cases in C correctly.
This seems right to me (modulo more assumptions on "we can find," not-too-largeness of the sets, etc.). But crucially, since the hypothesis quantifies over all sets E such that E0⊆E⊊C, this hypothesis becomes stronger the larger C is. In particular, if C were taken to include cases where the meaning of Q were fraught or context-dependent, then we should already have strong reason to doubt that this hypothesis is true (and therefore not be surprised when assuming the hypothesis produces counterintuitive results).
(Note that the ELK document is sensitive to concerns about questions being philosophically fraught, and only considers narrow ELK for cases where questions have unambiguous answers. It also seems important that part of the set-up of ELK is that the reporter must "know" the right answers and "understand" the meanings of the questions posed in natural language (for some values of "know" and "understand") in order for us to talk about eliciting its knowledge at all.)