Ajeya Cotra

Wiki Contributions

Comments

Yeah I agree more of the value of this kind of exercise (at least within the community) is in revealing more granular disagreements about various things. But I do think there's value in establishing to more external people something high level like "It really could be soon and it's not crazy or sci fi to think so."

Can you say more about what particular applications you had in mind?

Stuff like personal assistants who write emails / do simple shopping, coding assistants that people are more excited about than they seem to be about Codex, etc.

(Like I said in the main post, I'm not totally sure what PONR refers to, but don't think I agree that the first lucrative application marks a PONR -- seems like there are a bunch of things you can do after that point, including but not limited to alignment research.)

I don't see it that way, no. Today's coding models can help automate some parts of the ML researcher workflow a little bit, and I think tomorrow's coding models will automate more and more complex parts, and so on. I think this expansion could be pretty rapid, but I don't think it'll look like "not much going on until something snaps into place."

(Coherence aside, when I now look at that number it does seem a bit too high, and I feel tempted to move it to 2027-2028, but I dunno, that kind of intuition is likely to change quickly from day to day.)

Hm, yeah, I bet if I reflected more things would shift around, but I'm not sure the fact that there's a shortish period where the per-year probability is very elevated followed by a longer period with lower per-year probability is actually a bad sign.

Roughly speaking, right now we're in an AI boom where spending on compute for training big models is going up rapidly, and it's fairly easy to actually increase spending quickly because the current levels are low. There's some chance of transformative AI in the middle of this spending boom -- and because resource inputs are going up a ton each year, the probability of TAI by date X would also be increasing pretty rapidly.

But the current spending boom is pretty unsustainable if it doesn't lead to TAI. At some point in the 2040s or 50s, if we haven't gotten transformative AI by then, we'll have been spending 10s of billions training models, and it won't be that easy to keep ramping up quickly from there. And then because the input growth will have slowed, the increase in probability from one year to the next will also slow. (That said, not sure how this works out exactly.)

Where does the selection come from? Will the designers toss a really impressive AI for not getting reward on that one timestep? I think not.

I was talking about gradient descent here, not designers.

It doesn't seem like it would have to prevent us from building computers if it has access to far more compute than we could access on Earth. It would just be powerful enough to easily defeat the kind of AIs we could train with the relatively meager computing resources we could extract from Earth. In general the AI is a superpower and humans are dramatically technologically behind, so it seems like it has many degrees of freedom and doesn't have to be particularly watching for this.

Neutralizing computational capabilities doesn't seem to involve total destruction of physical matter or human extinction though, especially for a very powerful being. Seems like it'd be basically just as easy to ensure we + future AIs we might train are no threat as it is to vaporize the Earth.

My answer is a little more prosaic than Raemon. I don't feel at all confident that an AI that already had God-like abilities would choose to literally kill all humans to use their bodies' atoms for its own ends; it seems totally plausible to me that -- whether because of exotic things like "multiverse-wide super-rationality" or "acausal trade" or just "being nice" -- the AI will leave Earth alone, since (as you say) it would be very cheap for it to do so.

The thing I'm referring to as "takeover" is the measures that an AI would take to make sure that humans can't take back control -- while it's not fully secure and doesn't have God-like abilities. Once a group of AIs have decided to try to get out of human control, they're functionally at war with humanity. Humans could do things like physically destroy the datacenters they're running on, and they would probably want to make sure they can't do that.

Securing AI control and defending from human counter-moves seems likely to involve some violence -- but it could be a scale of violence that's "merely" in line with historical instances where a technologically more advanced group of humans colonized or took control of a less-advanced group of humans; most historical takeovers don't involve literally killing every single member of the other group.

The key point is that it seems likely that AIs will secure the power to get to decide what happens with the future; I'm pretty unsure exactly how they use it, and especially if it involves physically destroying Earth / killing all humans for resources. These resources seem pretty meager compared to the rest of the universe.

I'm pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn't expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.

The main channel of value that I see for doing work like "learning to summarize" and the critiques project and various interpretability projects is something like "identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree."

In the case of critiques -- ultimately, it seems like having AIs red team each other and pointing out ways that another AI's output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn't have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat -- and these smarter models can more effectively work on problems like alignment for us.[1]

It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don't substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)

When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn't going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)

I'm still not sure how valuable I think this work is, because I don't know how well it's doing at efficiently climbing tech trees or at picking the right tech trees, but I think that's how I'd think about evaluating it.

[1] Or do a "pivotal act," though I think I probably don't agree with some of the connotations of that term.

Load More