Your posts about the neocortex have been a plurality of the posts I've been most excited reading this year. I am super interested in the questions you're asking, and it has long driven me nuts that I don't find these questions asked often in the neuroscience literature.
But there's an aspect of these posts I've found frustrating, which is something like the ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."
Interestingly, I als... (read more)
Your posts about the neocortex have been a plurality of the posts I've been most excited reading this year.
Thanks so much, that really means a lot!!
...ratio of "listing candidate answers" to "explaining why you think those candidate answers are promising, relative to nearby alternatives."
I agree with "theories/frameworks relatively scarce". I don't feel like I have multiple gears-level models of how the brain might work, and I'm trying to figure out which one is right. I feel like I have zero, and I'm trying to grope my way towards one. It's almost more li... (read more)
Have you thought much about whether there are parts of this research you shouldn't publish?
Yeah, sure. I have some ideas about the gory details of the neocortical algorithm that I haven't seen in the literature. They might or might not be correct and novel, but at any rate, I'm not planning to post them, and I don't particularly care to pursue them, under the circumstances, for the reasons you mention.
Also, there was one post that I sent for feedback to a couple people in the community before posting, out of an abundance of caution. Neither person saw it a... (read more)
I feel confused about why, given your model of the situation, the researchers were surprised that this phenomenon occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just "weren't very familiar with AI." Looking at the author list, and at their publications (1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. While most of the eight co-authors are neuroscientists by training, three have CS degrees (one of whom is Demis Ha... (read more)
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
That said, I feel confused by a number of your arguments, so I'm working on a reply. Before I post it, I'd be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.
I currently understand you to be making four main claims:
Thanks. I know I came off pretty confrontational, sorry about that. I didn't mean to target you specifically; I really do see this as bad at the community level but fine at the individual level.
I don't think you've exactly captured what I meant, some comments below.
The system is just doing the totally normal thing “co
I agree, in the case of evolution/humans. In the text above, I meant to highlight what seemed to me like a relative lack of catastrophic *within-mind* inner alignment failures, e.g. due to conflicts between PFC and DA. Death of the organism feels to me like a reasonable way to operationalize "catastrophic" in these cases, but I can imagine other reasonable ways.
As I understand it, your point about the distinction between "mesa" and "steered" is chiefly that in the latter case, the inner layer is continually receiving reward signal from the outer layer, which in effect heavily restricts the space of possible algorithms the outer layer might give rise to. Does that seem like a decent paraphrase?
One of the aspects of Wang et al.'s paper that most interested me was that the inner layer in their meta-RL model kept learning even once reward signal from the outer layer had ceased. It seems reaso... (read more)
I mean, it could both be the case that there exists catastrophic inner alignment failure between humans and evolution, and also that humans don't regularly experience catastrophic inner alignment failures internally.
In practice I do suspect humans regularly experience internal (within-brain) inner alignment failures, but given that suspicion I feel surprised by how functional humans manage to be. That is, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.
I don't know why you expect an inner alignment failure to look dysfunctional. Instrumental convergence suggests that it would look functional. What the world looks like if there... (read more)
The thing I meant by "catastrophic" is "leading to the death of the organism." I'm suspicious that mesa-optimization is common in humans, although I don't feel confident of that. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just everyday "personal misalignment"-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomenon which might reasonably be described as inner alignment failures (although I can also imagine them bei... (read more)
Governments and corporations experience inner alignment failures all the time, but because of convergent instrumental goals, they are rarely catastrophic. For example, Russia underwent a revolution and a civil war on the inside, followed by purges and coups etc., but from the perspective of other nations, it was more or less still the same sort of thing: A nation, trying to expand its international influence, resist incursions, and conquer more territory. Even its alliances were based as much on expediency as on shared ideology.
Perhaps something similar happens with humans.