in part since I didn't see much disagreement.
FWIW, I appreciated that your curation notice explicitly includes the desire for more commentary on the results, and that curating it seems to have been a contributor to there being more commentary.
And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
Given karma inflation (as users gain more karma, their votes are worth more, but this doesn't propagate backwards to earlier votes they cast, and more people become AF voters than lose AF voter status), I think the karma differences between this post and these other 4 50+ karma posts [1 2 3 4] are basically noise. So I think the actual question is "is this post really in that tier?", to which "probably not" seems like a fair answer.
[I am thinking more about other points you've made, but it seemed worth writing a short reply on that point.]
This is extremely basic RL theory.
I note that this doesn't feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven't touched much RL, because they're focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
More broadly, I don't understand what people are talking about when they speak of the "likelihood" of mesa optimization.
I don't think I have a fully crisp view of this, but here's my frame on it so far:
One view is that we design algorithms to do things, and those algorithms have properties that we can reason about. Another is that we design loss functions, and then search through random options for things that perform well on those loss functions. In the second view, often which options we search through doesn't matter very much, because there's something like the "optimal solution" that all things we actually find will be trying to approximate in one way or another.
Mesa-optimization is something like, "when we search through the options, will we find something that itself searches through a different set of options?". Some of those searches are probably benign--the bandit algorithm updating its internal value function in response to evidence, for example--and some of those searches are probably malign (or, at least, dangerous). In particular, we might think we have restrictions on the behavior of the base-level optimizer that turn out to not apply to any subprocesses it manages to generate, and so those properties don't actually hold overall.
But it seems to me like overall we're somewhat confused about this. For example, the way I normally use the word "search", it doesn't apply to the bandit algorithm updating its internal value function. But does Abram's distinction between mesa-search and mesa-control actually mean much? There's lots of problems that you can solve exactly with calculus, and solve approximately with well-tuned simple linear estimators, and thus saying "oh, it can't do calculus, it can only do linear estimates" won't rule out it having a really good solution; presumably a similar thing could be true with "search" vs. "control," where in fact you might be able to build a pretty good search-approximator out of elements that only do control.
So, what would it mean to talk about the "likelihood" of mesa optimization? Well, I remember a few years back when there was a lot of buzz about hierarchical RL. That is, you would have something like a policy for which 'tactic' (or 'sub-policy' or whatever you want to call it) to deploy, and then each 'tactic' is itself a policy for what action to take. In 2015, it would have been sensible to talk about the 'likelihood' of RL models in 2020 being organized that way. (Even now, we can talk about the likelihood that models in 2025 will be organized that way!) But, empirically, this seems to have mostly not helped (at least as we've tried it so far).
As we imagine deploying more complicated models, it feels like there are two broad classes of things that can happen during runtime:
The two blur into each other; you can imagine training a model to deal with a range of situations, and yet it also performs well on situations not seen in training (that are interpolations between situations it has seen, or where the old abstractions apply correctly, and thus aren't "entirely new" situations). Just like some people argue that anything we know how to do isn't "artificial intelligence", you might get into a situation where anything we know how to do is task 'location' instead of task 'learning.'
But to the extent that our safety guarantees rely on the lack of capability in an AI system, any ability for the AI system to do learning instead of location means that it may gain capabilities we didn't expect it to have. That said, merely restricting it to 'location' may not help us very much, because if we misunderstand the abstractions that govern the system's generalizability, we may underestimate what capabilities it will or won't have.
There's clearly been a lot of engagement with this post, and yet this seemingly obvious point hasn't been said.
I think people often underestimate the degree to which, if they want to see their opinions in a public forum, they will have to be the one to post them. This is both because some points are less widely understood than you might think, and because even if the someone understands the point, that doesn't mean it connects to their interests in a way that would make them say anything about it.
The inner RL algorithm adjusts its learning rate to improve performance.
I have come across a lot of learning rate adjustment schemes in my time, and none of them have been 'obviously good', altho I think some have been conceptually simple and relatively easy to find. If this is what's actually going on and can be backed out, it would be interesting to see what it's doing here (and whether that works well on its own).
This is more concerning than a thermostat-like bag of heuristics, because an RL algorithm is a pretty agentic thing, which can adapt to new situations and produce novel, clever behavior.
Most RL training algorithms that we have look to me like putting a thermostat on top of a model; I think you're underestimating deep thermostats.
Currently, my first-pass check for "is this probably a natural abstraction?" is "can humans usually figure out what I'm talking about from a few examples, without a formal definition?". For human values, the answer seems like an obvious "yes". For evolutionary fitness... nonobvious. Humans usually get it wrong without the formal definition.
Hmm, presumably you're not including something like "internal consistency" in the definition of 'natural abstraction'. That is, humans who aren't thinking carefully about something will think there's an imaginable object even if any attempts to actually construct that object will definitely lead to failure. (For example, Arrow's Impossibility Theorem comes to mind; a voting rule that satisfies all of those desiderata feels like a 'natural abstraction' in the relevant sense, even though there aren't actually any members of that abstraction.)
All natural selection does is gradient descent (hill climbing technically), with no capacity for lookahead.
I think if you're interested in the analysis and classification of optimization techniques, there's enough differences between what natural selection is doing and what deep learning is doing that it isn't a very natural analogy. (Like, one is a population-based method and the other isn't, the update rules are different, etc.)
thanks to the capped returns
Out of the various mechanisms, I think the capped returns are relatively low ranking; probably the top on my list is the nonprofit board having control over decision-making (and implicitly the nonprofit board's membership not being determined by investors, as would happen in a normal company).
I agree that adding economic incentives is dangerous by default, but think their safeguards are basically adequate to overcome that incentive pressure. At the time I spent an hour trying to come up with improvements to the structure, and ended up not thinking of anything. Also remember that this sort of change, even if it isn't a direct improvement, can be an indirect improvement by cutting off unpleasant possibilities; for example, before the move to the LP, there was some risk OpenAI would become a regular for-profit, and the LP move dramatically lowered that risk.
I also think for most of the things I'm concerned about, psychological pressure to think the thing isn't dangerous is more important; like, I don't think we're in the cigarette case where it's mostly other people who get cancer while the company profits; I think we're in the case where either the bomb ignites the atmosphere or it doesn't, and even in wartime the evidence was that people would abandon plans that posed a serious chance of destroying humanity.
Note also that economic incentives quite possibly push away from AGI towards providing narrow services (see Drexler's various arguments that AGI isn't economically useful, and so people won't make it by default). If you are more worried about companies that want to build AGIs and then ask it what to do than you are about companies that want to build AIs to accomplish specific tasks, increased short-term profit motive makes OpenAI more likely to move in the second direction. [I think this consideration is pretty weak but worth thinking about.]
Also apparently Megaman is less popular than I thought so I added links to the names.
This might result in a different stance toward OpenAI
But part of the problem here is that the question "what's the impact of our stance on OpenAI on existential risks?" is potentially very different from "is OpenAI's current direction increasing or decreasing existential risks?", and as people outside of OpenAI have much more control over their stance than they do over OpenAI's current direction, the first question is much more actionable. And so we run into the standard question substitution problems, where we might be pretending to talk about a probabilistic assessment of an org's impact while actually targeting the question of "how do I think people should relate to OpenAI?".
[That said, I see the desire to have clear discussion of the current direction, and that's why I wrote as much as I did, but I think it has prerequisites that aren't quite achieved yet.]