All of Adam Scholl's Comments + Replies

My computational framework for the brain

Your posts about the neocortex have been a plurality of the posts I've been most excited reading this year. I am super interested in the questions you're asking, and it has long driven me nuts that I don't find these questions asked often in the neuroscience literature.

But there's an aspect of these posts I've found frustrating, which is something like the ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."

Interestingly, I als... (read more)

2Steve Byrnes7mo(Oops I just noticed that I had missed one of your questions in my earlier responses) I don't think there's anything to Bayesian priors beyond the general "society of compositional generative models" framework. For example, we have a prior that if someone runs towards a bird, it will fly away. There's a corresponding generative model: in that model, first there's a person running towards a bird, and then the bird is flying away. All of us have that generative model prominently in our brains, having seen it happen a bunch of times in the past. So when we see a person running towards a bird, that generative model gets activated, and it then sends a prediction that the bird is about to fly away. (Right? Or sorry if I'm misunderstanding your question.) (Not sure what you saw about dopamine distributions. I think everyone agrees that dopamine distributions are relevant to reward prediction, which I guess is a special case of a prior. I didn't think it was relevant for non-reward-related-priors, like the above prior above bird behavior, but I don't really know, I'm pretty hazy on my neurotransmitters, and each neurotransmitter seems to do lots of unrelated things.)

Your posts about the neocortex have been a plurality of the posts I've been most excited reading this year.

Thanks so much, that really means a lot!!

...ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."

I agree with "theories/frameworks relatively scarce". I don't feel like I have multiple gears-level models of how the brain might work, and I'm trying to figure out which one is right. I feel like I have zero, and I'm trying to grope my way towards one. It's almost more li... (read more)

Have you thought much about whether there are parts of this research you shouldn't publish?

Yeah, sure. I have some ideas about the gory details of the neocortical algorithm that I haven't seen in the literature. They might or might not be correct and novel, but at any rate, I'm not planning to post them, and I don't particularly care to pursue them, under the circumstances, for the reasons you mention.

Also, there was one post that I sent for feedback to a couple people in the community before posting, out of an abundance of caution. Neither person saw it a... (read more)

Matt Botvinick on the spontaneous emergence of learning algorithms

I feel confused about why, given your model of the situation, the researchers were surprised that this phenomenon occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just "weren't very familiar with AI." Looking at the author list, and at their publications (1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. While most of the eight co-authors are neuroscientists by training, three have CS degrees (one of whom is Demis Ha... (read more)

5Rohin Shah9moI think that's a fine characterization (and I said so in the grandparent comment? Looking back, I said I agreed with the claim that learning is happening via neural net activations, which I guess doesn't necessarily imply that I think it's a fine characterization). I think my original comment didn't do a great job of phrasing my objection. My actual critique is that the community as a whole seems to be updating strongly on data-that-has-high-probability-if-you-know-basic-RL. That was one of three possible explanations; I don't have a strong view on which explanation is the primary cause (if any of them are). It's more like "I observe clearly-to-me irrational behavior, this seems bad, even if I don't know what's causing it". If I had to guess, I'd guess that the explanation is a combination of readers not bothering to check details and those who are checking details not knowing enough to point out that this is expected. Indeed, I am also confused by this, as I noted in the original comment: I have a couple of hypotheses, none of which seem particularly likely given that the authors are familiar with AI, so I just won't speculate. I agree this is evidence against my claim that this would be obvious to RL researchers. Again, I don't object to the description of this as learning a learning algorithm. I object to updating strongly on this. Note that the paper does not claim their results are surprising -- it is written in a style of "we figured out how to make this approach work". (The DeepMind paper does claim that the results are novel / surprising, but it is targeted at a neuroscience audience, to whom the results may indeed be surprising.) On the search panpsychist view, my position is that if you use deep RL to train an AGI policy, it is definitionally a mesa optimizer. (Like, anything that is "generally intelligent" has the ability to learn quickly, which on the search panpsychist view means that it is a mesa optimizer.) So in this world, "likelihood of mesa
Matt Botvinick on the spontaneous emergence of learning algorithms

I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.

That said, I feel confused by a number of your arguments, so I'm working on a reply. Before I post it, I'd be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.

I currently understand you to be making four main claims:

  1. The
... (read more)
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.

Thanks. I know I came off pretty confrontational, sorry about that. I didn't mean to target you specifically; I really do see this as bad at the community level but fine at the individual level.

I don't think you've exactly captured what I meant, some comments below.

The system is just doing the totally normal thing “co
... (read more)
Matt Botvinick on the spontaneous emergence of learning algorithms

I agree, in the case of evolution/humans. In the text above, I meant to highlight what seemed to me like a relative lack of catastrophic *within-mind* inner alignment failures, e.g. due to conflicts between PFC and DA. Death of the organism feels to me like a reasonable way to operationalize "catastrophic" in these cases, but I can imagine other reasonable ways.

5Abram Demski9moI think it makes more sense to operationalize "catastrophic" here as "leading to systematically low DA reward", perhaps also including "manipulating the DA system in a clearly misaligned way". One way catastrophic alignment in this sense is difficult for humans is that the PFC cannot divorce itself from the DA; I'd expect that a failure mode leading to systematically low DA rewards would usually be corrected gradually, as the DA punishes those patterns. However, this is not really clear. The misaligned PFC might e.g. put itself in a local maximum, where it creates DA punishment for giving into temptation. (For example, an ascetic getting social reinforcement from a group of ascetics might be in such a situation.)
Matt Botvinick on the spontaneous emergence of learning algorithms

As I understand it, your point about the distinction between "mesa" and "steered" is chiefly that in the latter case, the inner layer is continually receiving reward signal from the outer layer, which in effect heavily restricts the space of possible algorithms the outer layer might give rise to. Does that seem like a decent paraphrase?

One of the aspects of Wang et al.'s paper that most interested me was that the inner layer in their meta-RL model kept learning even once reward signal from the outer layer had ceased. It seems reaso... (read more)

2Steve Byrnes9moYeah, that's part of it, but also I tend to be a bit skeptical that a performance-competitive optimizer will spontaneously develop, as opposed to being programmed—just as AlphaGo does MCTS because DeepMind programmed it to do MCTS, not because it was running a generic RNN that discovered MCTS. See also this [https://www.lesswrong.com/posts/SkcM4hwgH3AP6iqjs/can-you-get-agi-from-a-transformer] . Right now I'm kinda close to "More-or-less every thought I think has higher DA-related reward prediction than other potential thoughts I could have thought." But it's a vanishing fraction of cases where there is "ground truth" for that reward prediction that comes from outside of the neocortex. There is "ground truth" for things like pain and fear-of-heights, but not for thinking to yourself "hey, that's a clever turn of phrase" when you're writing. (The neocortex is the only place that understands language, in this example.) Ultimately I think everything has to come from subcortex-provided "ground truth" on what is or isn't rewarding, but the neocortex can get the idea that Concept X is an appropriate proxy / instrumental goal associated with some subcortex-provided reward, and then it goes and labels Concept X as inherently desirable, and searches for actions / thoughts that will activate Concept X. There's still usually some sporadic "ground truth", e.g. you have an innate desire for social approval and I think the subcortex has ways to figure out when you do or don't get social approval, so if your "clever turns of phrase" never impress anyone, you might eventually stop trying to come up with them. But if you're a hermit writing a book, the neocortex might be spinning for years treating "come up with clever turns of phrase" as an important goal, without any external subcortex-provided information to ground that goal. See here [https://www.lesswrong.com/posts/DWFx2Cmsvd4uCKkZ4/inner-alignment-in-the-brain] for more on this, if you're not sick of my endless self-citat
Matt Botvinick on the spontaneous emergence of learning algorithms

I mean, it could both be the case that there exists catastrophic inner alignment failure between humans and evolution, and also that humans don't regularly experience catastrophic inner alignment failures internally.

In practice I do suspect humans regularly experience internal (within-brain) inner alignment failures, but given that suspicion I feel surprised by how functional humans manage to be. That is, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.

In practice I do suspect humans regularly experience internal (within-brain) inner alignment failures, but given that suspicion I feel surprised by how functional humans manage to be. That is, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.

I don't know why you expect an inner alignment failure to look dysfunctional. Instrumental convergence suggests that it would look functional. What the world looks like if there... (read more)

Matt Botvinick on the spontaneous emergence of learning algorithms

The thing I meant by "catastrophic" is "leading to the death of the organism." I'm suspicious that mesa-optimization is common in humans, although I don't feel confident of that. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just everyday "personal misalignment"-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomenon which might reasonably be described as inner alignment failures (although I can also imagine them bei... (read more)

1Raymond Arnold9moThis doesn't seem like what it should mean here. I'd think catastrophic in the context of "how humans (programmed by evolution) might fail by evolution's standards" should mean "start pursuing strategies that don't result in many children or longterm population success." (where premature death of the organism might be one way to cause that, but not the only way)
1Kaj Sotala9moAs I understand it, Wang et al. found that their experimental setup trained an internal RL algorithm that was more specialized for this particular task, but was still optimizing for the same task that the RNN was being trained on? And it was selected exactly because it did that very goal better. If the circumstances changed so that the more specialized behavior was no longer appropriate, then (assuming the RNN's weights hadn't been frozen) the feedback to the outer network would gradually end up reconfiguring the internal algorithm as well. So I'm not sure how it even could end up with something that's "unrecognizably different" from the base objective - even after a distributional shift, the learned objective would probably still be recognizable as a special case of the base objective, until it updated to match the new situation. The thing that I would expect to see from this description, is that humans who were e.g. practicing a particular skill might end up becoming overspecialized to the circumstances around that skill, and need to occasionally relearn things to fit a new environment. And that certainly does seem to happen. Likewise for more general/abstract skills, like "knowing how to navigate your culture/technological environment", where older people's strategies are often more adapted to how society used to be rather than how it is now - but still aren't incapable of updating. Catastrophic misalignment seems more likely to happen in the case of something like evolution, where the two learning algorithms operate on vastly different timescales, and it takes a very long time for evolution to correct after a drastic distributional shift. But the examples in Wang et al. lead me to think that in the brain, even the slower process operates on a timescale that's on the order of days rather than years, allowing for reasonably rapid adjustments in response to distributional shifts. (Though it's plausible that the more structure there is in a need of readjustment, t
3orthonormal9moThe claim that came to my mind is that the conscious mind is the mesa-optimizer here, the original outer optimizer being a riderless elephant.

Governments and corporations experience inner alignment failures all the time, but because of convergent instrumental goals, they are rarely catastrophic. For example, Russia underwent a revolution and a civil war on the inside, followed by purges and coups etc., but from the perspective of other nations, it was more or less still the same sort of thing: A nation, trying to expand its international influence, resist incursions, and conquer more territory. Even its alliances were based as much on expediency as on shared ideology.

Perhaps something similar happens with humans.