Adam Scholl

Wiki Contributions


I agree we don’t currently know how to prevent AI systems from becoming adversarial, and that until we do it seems hard to make strong safety cases for them. But I think this inability is a skill issue, not an inherent property of the domain, and traditionally the core aim of alignment research was to gain this skill.

Plausibly we don’t have enough time to figure out how to gain as much confidence that transformative AI systems are safe as we typically have about e.g. single airplanes, but in my view that’s horrifying, and I think it’s useful to notice how different this situation is from the sort humanity is typically willing to accept.

Incorrect: OpenAI leadership is dismissive of existential risk from AI.

Why, then, would they continue to build the technology which causes that risk? Why do they consider it morally acceptable to build something which might well end life on Earth?

I don’t expect a discontinuous jump in AI systems’ generality or depth of thought from stumbling upon a deep core of intelligence

I felt surprised reading this, since "ability to automate AI development" feels to me like a central example of a "deep core of intelligence"—i.e., of a cognitive ability which makes attaining many other cognitive abilities far easier. Does it not feel like a central example to you?

Your posts about the neocortex have been a plurality of the posts I've been most excited to read this year. I'm super interested in the questions you're asking, and it drives me nuts that they're not asked more in the neuroscience literature.

But there's an aspect of these posts I've found frustrating, which is something like the ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."

Interestingly, I also have this gripe when reading Friston and Hawkins. And I feel like I also have this gripe about my own reasoning, when I think about this stuff—it feels phenomenologically like the only way I know how to generate hypotheses in this domain is by inducing a particular sort of temporary overconfidence, or something.

I don't feel incentivized to do this nearly as much in other domains, and I'm not sure what's going on. My lead hypothesis is that in neuroscience, data is so abundant, and theories/frameworks so relatively scarce, that it's unusually helpful to ignore lots of things—e.g. via the "take as given x, y, z, and p" motion—in order to make conceptual progress. And maybe there's just so much available data here that it would be terribly sisiphean to try to justify all the things one takes as given when forming or presenting intuitions about underlying frameworks. (Indeed, my lead hypothesis for why so many neuroscientists seem to employ strategies like, "contribute to the 'figuring out what roads do' project by spending their career measuring the angles of stop-sign poles relative to the road," is that they feel it's professionally irresponsible, or something, to theorize about underlying frameworks without first trying to concretely falsify a mountain of assumptions).

I think some amount of this motion is helpful for avoiding self-delusion, and the references in your posts make me think you do it at least a bit already. So I guess I just want to politely—and super gratefully, I'm really glad you write these posts regardless! If trying to do this would turn you into a stop sign person, don't do it!—suggest that explicating these more might make it easier for readers to understand your intuitions.

I have many proto-questions about your model, and don't want to spend the time to flesh them all out. But here are some sketches that currently feel top-of-mind:

  • Say there exist genes that confer advantage in math-ey reasoning. By what mechanism is this advantage mediated, if the neocortex is uniform? One story, popular among the "stereotypes of early 2000s cognitive scientists" section of my models, is that brains have an "especially suitable for maths" module, and that genes induce various architectural changes which can improve or degrade its quality. What would a neocortical uniformist's story be here—that genes induce architectural changes which alter the quality of the One Learning Algorithm in general? If you explain it as genes having the ability to tweak hyperparameters or the gross wiring diagram in order to degrade or improve certain circuits' ability to run algorithms this domain-specific, is it still explanatorily useful to describe the neocortex as uniform?
    • My quick, ~90 min investigation into whether neuroscience as a field buys the neocortical uniformity hypothesis suggested it's fairly controversial. Do you know why? Are the objections mostly similar to those of Marcus et al.?
  • Do you have the intuition that aspects of the neocortical algorithm itself (or the subcortical algorithms themselves) might be safety-relevant? Or is your safety-relevance intuition mostly about the subcortical steering mechanism? (Fwiw, I have the former intuition, in that I'm suspicious some of the features of the neocortical algorithm that cause humans to differ from "hardcore optimizers" exist for safety-relevant reasons).
  • In general I feel frustrated with the focus in neuroscience on the implementational Marr Level, relative to the computational and algorithmic levels. I liked the mostly-computational overview here, and the algorithmic sketch in your Predictive Coding = RL + SL + Bayes + MPC post, but I feel bursting with implementational questions. For example:
    • As I understand it, you mention "PGM-type message passing" as a candidate class of algorithm that might perform the "select the best from a population of models" function. Do you just mean you suspect there is something in the general vicinity of a belief propagation algorithm going on here, or is your intuition more specific? If the latter, is the Dileep George paper the main thing motivating that intuition?
    • I don't currently know whether the neuroscience lit contains good descriptions of how credit assignment is implemented. Do you? Do you feel like you have a decent guess, or know whether someone else does?
      • I have the same question about whatever mechanism approximates Bayesian priors—I keep encountering vague descriptions of it being encoded in dopamine distributions, but I haven't found a good explanation of how that might actually work.
  • Are you sure PP deemphasizes the "multiple simultaneous generative models" frame? I understood the references to e.g. the "cognitive economy" in Surfing Uncertainty to be drawing an analogy between populations of individuals exchanging resources in a market, and populations of models exchanging prediction error in the brain.
  • Have you thought much about whether there are parts of this research you shouldn't publish? I notice feeling slightly nervous every time I see you've made a new post, I think because I basically buy the "safety and capabilities are in something of a race" hypothesis, and fear that succeeding at your goal and publishing about it might shorten timelines.

I feel confused about why, on this model, the researchers were surprised that this occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just weren't very familiar with AI. But looking at the author list, and their publications (e.g.1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. Most of the co-authors are neuroscientists by training, but a few have CS degrees, and all but one have co-authored previous ML papers. It's hard for me to imagine their surprise was due to them lacking basic knowledge about RL?

Also, this OpenAI paper (whose authors seem quite familiar with ML)—which the summary of Wang et al. on DeepMind's website describes as "closely related work," and which appears to me to involve a very similar setup— describes their result similarly:

We structure the agent as a recurrent neural network, which receives past rewards, actions, and termination flags as inputs in addition to the normally received observations. Furthermore, its internal state is preserved across episodes, so that it has the capacity to perform learning in its own hidden activations. The learned agent thus also acts as the learning algorithm, and can adapt to the task at hand when deployed.

As I understand it, the OpenAI authors also think they can gather evidence about the structure of the algorithm simply by looking at its behavior. Given a similar series of experiments (mostly bandit tasks, but also a maze solver), they conclude:

the dynamics of the recurrent network come to implement a learning algorithm entirely separate from the one used to train the network weights... the procedure the recurrent network implements is itself a full-fledged reinforcement learning algorithm, which negotiates the exploration-exploitation tradeoff and improves the agent’s policy based on reward outcomes... this learned RL procedure can differ starkly from the algorithm used to train the network’s weights.

They then run an experiment designed specifically to distinguish whether meta-RL was giving rise to a model-free system, or “a model-based system which learns an internal model of the environment and evaluates the value of actions at the time of decision-making through look-ahead planning,” and suggest the evidence implies the latter. This sounds like a description of search to me—do you think I'm confused?

I get the impression from your comments that you think it's naive to describe this result as "learning algorithms spontaneously emerging." You describe the lack of LW/AF pushback against that description as "a community-wide failure," and mention updating as a result toward thinking AF members “automatically believe anything written in a post without checking it.”

But my impression is that OpenAI describes their similar result in a similar way. Do you think my impression is wrong? Or that e.g. their description is also misleading?


I've been feeling very confused lately about how people talk about "search," and have started joking that I'm a search panpsychist. Lots of interesting phenomenon look like piles of thermostats when viewed from the wrong angle, and I worry the conventional lens is deceptively narrow.

That said, when I condition on (what I understand to be) the conventional conception, it's difficult for me to imagine how e.g. the maze-solver described in the OpenAI paper can quickly and reliably locate maze exits, without doing something reasonably describable as searching for them.

And it seems to me that Wang et al. should be taken as evidence that "learning algorithms producing other search-performing learning algorithms" is convergently useful/likely to be a common feature of future systems, even if you don't think that's what happened in their paper, as long as you assign decent credence to their underlying model that this is what's going on in PFC, and that search occurs in PFC.

If the primary difference between the DeepMind and OpenAI meta-RL architecture and the PFC/DA architecture is scale, I think there's reasonable reason to suspect something much like mesa-optimization will emerge in future meta-RL systems, even if it hasn't yet. That is, I interpret this result as evidence for the hypothesis that highly competent general-ish learners might tend to exhibit this feature, since (among other reasons) it increased my credence that it is already exhibited by the only existing member of that reference class.

Evan mentions agreeing that this result isn't new evidence in favor of mesa-optimization. But he also mentions that Risks from Learned Optimization references these two papers, and describes them as "the closest to producing mesa-optimizers of any existing machine learning research." I feel confused about how to reconcile these two claims. I didn't realize these papers were mentioned in Risks from Learned Optimization, but if I had, I think I would have been even more inclined to post this/try to ensure people knew about the results, since my (perhaps naive, perhaps not understanding ways this is disanalogous) prior is that the closest existing example to this problem might provide evidence about its nature or likelihood.

I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.

That said, I feel confused by a number of your arguments, so I'm working on a reply. Before I post it, I'd be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.

I currently understand you to be making four main claims:

  1. The system is just doing the totally normal thing “conditioning on observations,” rather than something it makes sense to describe as "giving rise to a separate learning algorithm."
  2. It is probably not the case that in this system, “learning is implemented in neural activation changes rather than neural weight changes.”
  3. The system does not encode a search algorithm, so it provides “~zero evidence” about e.g. the hypothesis that mesa-optimization is convergently useful, or likely to be a common feature of future systems.
  4. The above facts should be obvious to people familiar with ML.

Does this summary feel like it reasonably characterizes your objections?

I agree, in the case of evolution/humans. I meant to highlight what seemed to me like a relative lack of catastrophic within-mind inner alignment failures, e.g. due to conflicts between PFC and DA. Death of the organism feels to me like one reasonable way to operationalize "catastrophic" in these cases, but I can imagine other reasonable ways.

As I understand it, your point about the distinction between "mesa" and "steered" is chiefly that in the latter case, the inner layer is continually receiving reward signal from the outer layer, which in effect heavily restricts the space of possible algorithms the outer layer might give rise to. Does that seem like a decent paraphrase?

One of the aspects of Wang et al.'s paper that most interested me was that the inner layer in their meta-RL model kept learning even once reward signal from the outer layer had ceased. It feels plausible to me that the relationship between PFC and DA is reasonably describable as something like "subcortex-supervised learning," where PFCs input signals are "labeled" by the DA-supervisor. But it doesn't feel intuitively obvious to me that the portion of PFC input which might be labeled in this way is high—e.g., I feel unconfident about what portion of the concepts currently active in my working memory while writing this paragraph might be labeled by DA—nor that it much restricts the space of possible algorithms that can arise in PFC.

It could both be the case that there exists catastrophic inner alignment failure between humans and evolution, and also that humans don't regularly experience catastrophic inner alignment failures internally.

In practice I do suspect humans regularly experience internal inner alignment failures, but given that suspicion I feel surprised by how functional humans do manage to be. In other words, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.

The thing I meant by "catastrophic" is just "leading to death of the organism." I suspect mesa-optimization is common in humans, but I don't feel confident about this, nor that this is a joint-carvey ontology. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just "everyday personal misalignment"-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomena which might reasonably be described as inner alignment failures.

But I think these things don't kill people very often? People do sometimes choose to die because of beliefs. And anorexia sometimes kills people, which currently feels to me like the most straightforward candidate example I've considered.

I just feel like things could be a lot worse. For example, it could have been the case that mind-architectures that give rise to mesa-optimization at all simply aren't viable at high levels of optimization power—that it always kills them. Or that it basically always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don't think you see these things, so I'm curious how evolution prevented them.

Load More