Even if BigCo senior management were virtuous and benevolent, and their workers were loyal and did not game the rules, the poor rules would still cause problems.
If BigCo senior management were virtuous and benevolent, would they have poor rules?
That is to say, when I put my Confucian hat on, the whole system of selecting managers based on a proxy measure that's gameable feels too Legalist. [The actual answer to my question is "getting rid of poor rules would be a low priority, because the poor rules wouldn't impede righteous conduct, but they still would try to get rid of them."]
Like, if I had to point at the difference between the two, the difference is where the put the locus of value. The Confucian ruler is primarily focused on making the state good, and surrounding himself with people who are primarily focused on making the state good. The Legalist ruler is primarily focused on surviving and thriving, and so tries to set up systems that cause people who are primarily focused on surviving and thriving to do the right thing. The Confucian imagines that you can have a large shared value; the Legalist imagines that you will necessarily have many disconnected and contradictory values.
The difference between hiring for regular companies and EA orgs seems relevant. Often, applicants for regular companies want the job, and standard practice is to attempt to trick the company into hiring them, regardless of qualification. Often, applicants for EA orgs only want the job if and only if they're the right person for the job; if I'm trying to prevent asteroids from hitting the Earth (or w/e) and someone else could do a better job of it than I could, I very much want to get out of their way and have them do it instead of me. As you mention in the post, this just means you get rid of the part of interviews where gaming is intentional, and significant difficulty remains. [Like, people will be honest about their weaknesses and try to be honest about their strengths, but accurately measuring those and fit with the existing team remains quite difficult.]
Now, where they're trying to put the locus of value doesn't mean their policy prescriptions are helpful. As I understand the Confucian focus on virtue in the leader, the main value is that it's really hard to have subordinates who are motivated by the common good if you yourself are selfish (both because they won't have your example and because the people who are motivated by the common good will find it difficult to be motivated by working for you).
But I find myself feeling some despair at the prospect of a purely Legalist approach to AI Alignment, because it feels like it is fighting against the AI at every step, instead of being able to recruit it to do some of the work for you, and without that last bit I'm not sure how you get extrapolation instead of interpolation. Like, you can trust the Confucian to do the right thing in novel territory, insofar as you gave them the right underlying principles, and the Confucian is operating at a philosophical level where you can give them concepts like corrigibility (where they not only want to accept correction from you, but also want to preserve their ability to accept correction for you, and preserve their preservation of that ability, and so on) and the map-territory distinction (where they want their sensors to be honest, because in order to have lots of strawberries they need their strawberry-counter to be accurate instead of inaccurate). In Legalism, the hope is that the overseer can stay a step ahead of their subordinate; in Confucianism, the hope is that everyone can be their own overseer.[Of course, defense in depth is useful; it's good to both have trust in the philosophical competence of the system and have lots of unit tests and restrictions in case you or it are confused.]
"Completely adversarial" also better captures the strange feature of zero-sum games where doing damage to your opponent, by the nature of it being zero-sum, necessarily means improving your satisfaction, which is a very narrow class of situations.
We could then back out what a rational firm should be willing to invest.
This makes sense, altho I note that I expect the funding here to quite plausibly be 'irrational.' For example, some substantial fraction of Microsoft's value captured is going to global development in a way that seems unlikely to make sense from Microsoft's bottom line (because Microsoft enriched one of its owners, who then decided to deploy those riches for global development). If building TAI comes out of the 'altruism' or 'exploration' budget instead of the 'we expect this to pay back on schedule' budget, you could see more investment than that last category would justify.
Part 1 page 15 talks about "spending on computation", and assumes spending saturates at 1% of the GDP of the largest country. This seems potentially odd to me; quite possibly the spending will be done by multinational corporations that view themselves as more "global" than "American" or "British" or whatever, and whose fortunes are more tied to the global economy than to the national economy. At most this gives you a factor of 2-3 doublings, but that's still 4-6 years on a 2-year doubling time.
Overall I'm not sure how much to believe this hypothesis; my mainline prediction is that corporations grow in power and rootlessness compared to nation-states, but it also seems likely that bits of the global economy will fracture / there will be a push to decentralization over centralization, where (say) Alphabet is more like "global excluding China, where Baidu is supreme" than it is "global." In that world, I think you still see approximately a 4x increase.
I also don't have a great sense how we should expect the 'ability to fund large projects' to compare between the governments of the past and the megacorps of the future; it seems quite plausible to me that Alphabet, without pressure to do welfare spending / fund the military / etc. could put a much larger fraction of its resources towards building TAI, but also presumably this means Alphabet has many fewer resources than the economy as a whole (because there still will be welfare spending and military funding and so on), and on net this probably works out to 1% of total gdp available for megaprojects.
Thanks for sharing this draft! I'm going to try to make lots of different comments as I go along, rather than one huge comment.
[edit: page 10 calls this the "most important thread of further research"; the downside of writing as I go! For posterity's sake, I'll leave the comment.]
Pages 8 and 9 of part 1 talk about "effective horizon length", and make the claim:
Prima facie, I would expect that if we modify an ML problem so that effective horizon length is doubled (i.e, it takes twice as much data on average to reach a certain level of confidence about whether a perturbation to the model improved performance), the total training data required to train a model would also double. That is, I would expect training data requirements to scale linearly with effective horizon length as I have defined it.
I'm curious where 'linearly' came from; my sense is that "effective horizon length" is the equivalent of "optimal batch size", which I would have expected to be a weirder function of training data size than 'linear'. I don't have a great handle on the ML theory here, tho, and it might be substantially different between classification (where I can make batch-of-the-envelope estimates for this sort of thing) and RL (where it feels like it's a component of a much trickier system with harder-to-predict connections).
Quite possibly you talked with some ML experts and their sense was "linearly", and it makes sense to roll with that; it also seems quite possible that the thing to do here is have uncertainty over functional forms. That is, maybe the effective horizon scales linearly, or maybe it scales exponentially, or maybe it scales logarithmically, or inverse square root, or whatever. This would help double-check that the assumption of linearity isn't doing significant work, and if it is, point to a potentially promising avenue of theoretical ML research.
[As a broader point, I think this 'functional form uncertainty' is a big deal for my timelines estimates. A lot of people (rightfully!) dismissed the standard RL algorithms of 5 years ago for making AGI because of exponential training data requirements, but my sense is that further algorithmic improvement is mostly not "it's 10% faster" but "the base of the exponent is smaller" or "it's no longer exponential.", which might change whether or not it makes sense to dismiss it.]
A simple, well-funded example is autonomous vehicles, which have spent considerably more than the training budget of AlphaStar, and are not there yet.
I am aware of other examples that do seem to be happening, but I'm not sure what the cutoff for 'massive' should be. For example, a 'call center bot' is moderately valuable (while not nearly as transformative as autonomous vehicles), and I believe there are many different companies attempting to do something like that, altho I don't know how their total ML expenditure compared to AlphaStar's. (The company I'm most familiar with in this space, Apprente, got acquired by McDonalds last year, who I presume is mostly interested in the ability to automate drive-thru orders.)
Another example that seems relevant to me is robotic hands (plus image classification) at sufficient level that warehouse pickers could be replaced by robots.
in part since I didn't see much disagreement.
FWIW, I appreciated that your curation notice explicitly includes the desire for more commentary on the results, and that curating it seems to have been a contributor to there being more commentary.
And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
Given karma inflation (as users gain more karma, their votes are worth more, but this doesn't propagate backwards to earlier votes they cast, and more people become AF voters than lose AF voter status), I think the karma differences between this post and these other 4 50+ karma posts [1 2 3 4] are basically noise. So I think the actual question is "is this post really in that tier?", to which "probably not" seems like a fair answer.
[I am thinking more about other points you've made, but it seemed worth writing a short reply on that point.]
This is extremely basic RL theory.
I note that this doesn't feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven't touched much RL, because they're focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
More broadly, I don't understand what people are talking about when they speak of the "likelihood" of mesa optimization.
I don't think I have a fully crisp view of this, but here's my frame on it so far:
One view is that we design algorithms to do things, and those algorithms have properties that we can reason about. Another is that we design loss functions, and then search through random options for things that perform well on those loss functions. In the second view, often which options we search through doesn't matter very much, because there's something like the "optimal solution" that all things we actually find will be trying to approximate in one way or another.
Mesa-optimization is something like, "when we search through the options, will we find something that itself searches through a different set of options?". Some of those searches are probably benign--the bandit algorithm updating its internal value function in response to evidence, for example--and some of those searches are probably malign (or, at least, dangerous). In particular, we might think we have restrictions on the behavior of the base-level optimizer that turn out to not apply to any subprocesses it manages to generate, and so those properties don't actually hold overall.
But it seems to me like overall we're somewhat confused about this. For example, the way I normally use the word "search", it doesn't apply to the bandit algorithm updating its internal value function. But does Abram's distinction between mesa-search and mesa-control actually mean much? There's lots of problems that you can solve exactly with calculus, and solve approximately with well-tuned simple linear estimators, and thus saying "oh, it can't do calculus, it can only do linear estimates" won't rule out it having a really good solution; presumably a similar thing could be true with "search" vs. "control," where in fact you might be able to build a pretty good search-approximator out of elements that only do control.
So, what would it mean to talk about the "likelihood" of mesa optimization? Well, I remember a few years back when there was a lot of buzz about hierarchical RL. That is, you would have something like a policy for which 'tactic' (or 'sub-policy' or whatever you want to call it) to deploy, and then each 'tactic' is itself a policy for what action to take. In 2015, it would have been sensible to talk about the 'likelihood' of RL models in 2020 being organized that way. (Even now, we can talk about the likelihood that models in 2025 will be organized that way!) But, empirically, this seems to have mostly not helped (at least as we've tried it so far).
As we imagine deploying more complicated models, it feels like there are two broad classes of things that can happen during runtime:
The two blur into each other; you can imagine training a model to deal with a range of situations, and yet it also performs well on situations not seen in training (that are interpolations between situations it has seen, or where the old abstractions apply correctly, and thus aren't "entirely new" situations). Just like some people argue that anything we know how to do isn't "artificial intelligence", you might get into a situation where anything we know how to do is task 'location' instead of task 'learning.'
But to the extent that our safety guarantees rely on the lack of capability in an AI system, any ability for the AI system to do learning instead of location means that it may gain capabilities we didn't expect it to have. That said, merely restricting it to 'location' may not help us very much, because if we misunderstand the abstractions that govern the system's generalizability, we may underestimate what capabilities it will or won't have.
There's clearly been a lot of engagement with this post, and yet this seemingly obvious point hasn't been said.
I think people often underestimate the degree to which, if they want to see their opinions in a public forum, they will have to be the one to post them. This is both because some points are less widely understood than you might think, and because even if the someone understands the point, that doesn't mean it connects to their interests in a way that would make them say anything about it.
The inner RL algorithm adjusts its learning rate to improve performance.
I have come across a lot of learning rate adjustment schemes in my time, and none of them have been 'obviously good', altho I think some have been conceptually simple and relatively easy to find. If this is what's actually going on and can be backed out, it would be interesting to see what it's doing here (and whether that works well on its own).
This is more concerning than a thermostat-like bag of heuristics, because an RL algorithm is a pretty agentic thing, which can adapt to new situations and produce novel, clever behavior.
Most RL training algorithms that we have look to me like putting a thermostat on top of a model; I think you're underestimating deep thermostats.