All of AdamGleave's Comments + Replies

Rachel did the bulk of the work on this post (well-done!), I just provided some advise on the project and feedback on earlier manuscripts.

I wanted to share why I'm personally excited by this work in case it helps contextualize it for others.

We'd all like AI systems to be "corrigible", cooperating with us in correcting them. Cooperative IRL has been proposed as a solution to this. Indeed Dylan Hadfield-Menell et al show that CIRL is provably corrigible in a simple setting, the off-switch game.

Provably corrigible sounds great, but where there's a proof there... (read more)

I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.

In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the... (read more)

1Johannes Treutlein3mo
I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity's potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape. E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. It seems unclear whether these things would tend to balance out, or whether e.g. everyone will inevitably be exposed to some persuasion that causes irreparable damage. Of course, things could also work out better than expected, if our ability to keep AIs in check scales better than dangerous capabilities.

Thanks for this response, I'm glad to see more public debate on this!

The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. Bu... (read more)

3Erik Jenner3mo
Interesting points, I agree that our response to part C doesn't address this well. AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It's harder to make an airtight x-risk argument using fast takeoff, since I don't think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we're figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for). A more robust worry (and what I'd probably rely on for a compact argument) is something like What Failure Looks Like [https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like] Part 1: maybe AIs work within the system, in the sense that they don't take over the world in obvious, visible ways. They usually don't break laws in ways we'd notice, they don't kill humans, etc. On paper, humans "own" the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively "take over the world" (in the sense that humans don't actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren't great at collaborating against humanity. Addressing a few specific points

Thanks for the quick reply! I definitely don't feel confident in the 20W number, I could believe 13W is true for more energy efficient (small) humans, in which case I agree your claim ends up being true some of the time (but as you say, there's little wiggle room). Changing it to 1000x seems like a good solution though which gives you plenty of margin for error.

This is a nitpick, but I don't think this claim is quite right (emphasis added)

 If a silicon-chip AGI server were literally 10,000× the volume, 10,000× the mass, and 10,000× the power consumption of a human brain, with comparable performance, I don’t think anyone would be particularly bothered—in particular, its electricity costs would still be below my local minimum wage!!

First, how much power does the brain use? 20 watts is StackExchange's answer, but I've struggled to find good references here. The appealingly named Appraising the brain's energy bu... (read more)

3Steve Byrnes4mo
Thanks! Prior to your comment, the calculation in my head was 12 W × 10,000 × 10¢/kWh < $14.25/hr. The biggest difference from you is that I had heard 12 watts for brain energy consumption somewhere, and neglected to check it. I don’t recall where I had heard that, but for example, 12 W is in this article [https://www.scientificamerican.com/article/thinking-hard-calories/]. They used the 20% figure, but for resting metabolic rate they cite this [https://journals.physiology.org/doi/abs/10.1152/jappl.1993.75.6.2514] which says 1740 kcal/day (→16.9W) in men, 1348 kcal/day (→13.1W) in women, and the article turns 13.1W into 12W by sketchy rounding. That still presupposes that the 20% is valid in both genders. I traced the “20%” back to here [https://pubmed.ncbi.nlm.nih.gov/11598490/] which cites papers from 1957 & 1960 (and 1997 but that’s another secondary source). I downloaded the 1957 source (Kety, “The general metabolism of the brain in vivo”. In: Metabolism of the nervous system (Richter D, ed), pp 221–237), and it did cite studies of both men and women, and suggested that it scales with brain mass. I don’t understand everything that goes into the calculation, but they do say 20 W directly, so I certainly feel best about that number, but AFAICT it remains likely that the power would lower for smaller-than-average people including most women. I’m still confused about the discrepency with earlier in this paragraph, but I don’t want to spend more time on it. ¯\_(ツ)_/¯ My intended meaning was that the “power consumption” of “a silicon-chip AGI server” was all-in power consumption including HVAC, but I can see how a reader could reasonably interpret my words as excluding HVAC. I specifically said “my local minimum wage” because I happen to live in a state (Massachusetts) with high minimum wage of $14.25/hr. (The cost to the employer is of course a bit higher, thanks to legally-mandated employer taxes, sick days, sick-family days, etc.) Granted, we have unusually exp

"The Floating Droid" example is interesting as there's a genuine ambiguity in the task specification here. In some sense that means there's no "good" behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that's outside the scope of this contest.) But it's interesting the interpretation flips with model scale, and in the opposite direction to what I'd have predicted (doing EV calculations are harder so I'd have expected scale to increase not decrease EV answers.) Follow-up questions... (read more)

3Rohin Shah3mo
I think the inverse scaling here is going from "random answer" to "win/loss detection" rather than "EV calculation" to "win/loss detection".

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.


A rough heuristic I have is that if the idea you're introducing is highly novel, it's OK to not be rigorous. Your contribution is bringing this new, potentially very promising, idea to people's attention. You're seeking feedback on how promising it really is and where people are confused , which will be helpful for then later formalizing it and studying it more rigo... (read more)

1David Scott Krueger5mo
I think part of this has to do with growing pains in the LW/AF community... When it was smaller it was more like an ongoing discussion with a few people and signal-to-noise wasn't as important, etc.

Work that is still outside the academic Overton window can be brought into academia if it can be approached with the technical rigor of academia, and work that meets academic standards is much more valuable than work that doesn't; this is both because it can be picked up by the ML community, and because it's much harder to tell if you are making meaningful progress if your work doesn't meet these standards of rigor.

Strong agreement with this! I'm frequently told by people that you "cannot publish" on a certain area, but in my experience this is rarely true... (read more)

Presumably "too dismissive of speculative and conceptual research" is a direct consequence of increased emphasis on rigor. Rigor is to be preferred all else being equal, but all else is not equal.

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

I note that within rigorous fields, the downsides of rigor are not obvious: we can point to all the progress made; progress that wasn't made due to the neglect of conceptua... (read more)

2David Scott Krueger5mo
Agree RE systemic blindspots, although the "algorithmic contribution" thing is sort of a known issue that a lot of senior people disagree with, IME.

A related dataset is Waterbirds, described in Sagawa et al (2020), where you want to classify birds as landbirds or waterbirds regardless of whether they happen to be on a water or land background.

The main difference from HappyFaces is that in Waterbirds the correlation between bird type and background is imperfect, although strong. By contrast, HappyFaces has perfect spurious correlation on the training set. Of course you could filter Waterbirds to make the spurious correlation perfect to get an equally challenging but more natural dataset.

My sense is that Stuart assuming there's an initial-specified reward function is a simplification, not a key part of the plan, and that he'd also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison.

IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn't really that unique to IRD -- Bayesian IRL or preference comparison would have the same property.

2Rohin Shah1y
Yeah, I agree with that. (I don't think we have experience with deep Bayesian versions of IRL / preference comparison at CHAI, and I was thinking about advice on who to talk to)

I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)

Ah, the "model-based using a model-free RL algorithm" approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You'... (read more)

Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.

Most model-based RL algorithms I've seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don't see how you solve this problem in general in a sample-efficient manner otherwise.

One class of model-based RL approaches is based on [model-predictive control](ht... (read more)

2Steve Byrnes2y
Hmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can't—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled "model-based RL Atari" and the first hit was this [https://arxiv.org/abs/1903.00374] which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I'm not intimately familiar with the deep RL literature, I wouldn't know what's typical and I'll take your word for it, but it does seem that both possibilities are out there. Anyway, I don't think the neocortex can evaluate the true reward function in arbitrary states, because it's not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that's the whole thing with TD learning and dopamine. I added a new sub-bullet at the top to clarify that it's hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the "other possible explanations" section at the bottom saying what I said in the paragraph just above. Thank you. Well, the rats are trying to do the rewarding thing after zero samples, so I don't think "sample-efficiency" is quite the right framing. In ML today, the reward function is typically a function of states and actions, not "thoughts". In a brain, the reward can depend directly on what you're imagining doing or planning to do, or even just what you're thinking about. That's my proposal here. Well, I guess you could say that this is still a "normal MDP", but where "having thoughts" and "having ideas" etc. a

I'm a bit confused by the intro saying that RL can't do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:

  - Agent explores by pressing lever, learns transition dynamics that pressing lever => spray of salt water.

  - Planner concludes that any sequence of actions involving pressing lever wi... (read more)

Good question! Sorry I didn't really explain. The missing piece is "the planner will conclude this has positive reward". The planner has no basis for coming up with this conclusion, that I can see.

In typical RL as I understand it, regardless of whether it's model-based or model-free, you learn about what is rewarding by seeing the outputs of the reward function. Like, if an RL agent is playing an Atari game, it does not see the source code that calculates the reward function. It can try to figure out how the reward function works, for sure, but when it doe... (read more)

Thanks for the post, this is my favourite formalisation of optimisation so far!

One concern I haven't seen raised so far, is that the definition seems very sensitive to the choice of configuration space. As an extreme example, for any given system, I can always augment the configuration space with an arbitrary number of dummy dimensions, and choose the dynamics such that these dummy dimensions always get set to all zero after each time step. Now, I can make the basin of attraction arbitrarily large, while the target configuration set remains a fixed si... (read more)

I feel like there are three facets to "norms" v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the 'negative' rights is probably more important, but it seems to me t... (read more)

1Rohin Shah4y
Yeah, agreed with all of that, thanks for the comment. You could definitely try to figure out each of these things individually, eg. learning constraints that can be used with Constrained Policy Optimization [https://arxiv.org/abs/1705.10528] is along the "what not to do" axis, and a lot of the multiagent RL work is looking at how we can get some norms to show up with decentralized training. But I feel a lot more optimistic about research that is trying to do all three things at once, because I think the three aspects do interact with each other. At least, the first two feel very tightly linked, though they probably can be separated from the multiagent setting.