So8res — AI Alignment Forum

Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are. Okay, so you know...

Nov 24, 2023213

AI as a science, and three obstacles to alignment strategies

AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition. Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours...

Oct 25, 2023198

Cosmopolitan values don't come free

Short version: if the future is filled with weird artificial and/or alien minds having their own sort of fun in weird ways that I might struggle to understand with my puny meat-brain, then I'd consider that a win. When I say that I expect AI to destroy everything we value,...

May 31, 2023138

Sentience matters

Short version: Sentient lives matter; AIs can be people and people shouldn't be owned (and also the goal of alignment is not to browbeat AIs into doing stuff we like that they'd rather not do; it's to build them de-novo to care about valuable stuff). Context: Writing up obvious points...

May 29, 2023145

But why would the AI kill us?

Status: Partially in response to We Don't Trade With Ants, partly in response to watching others try to make versions of this point that I didn't like. None of this is particularly new; it feels to me like repeating obvious claims that have regularly been made in comments elsewhere, and...

Apr 17, 2023142

Misgeneralization as a misnomer

Here's two different ways an AI can turn out unfriendly: 1. You somehow build an AI that cares about "making people happy". In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it's more capable), it forcibly...

Apr 6, 2023128

If interpretability research goes well, it may get dangerous

I've historically been pretty publicly supportive of interpretability research. I'm still supportive of interpretability research. However, I do not necessarily think that all of it should be done in the open indefinitely. Indeed, insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage...

Apr 3, 2023203

Nate Soares

Nate Soares

Nate Soares

On how various plans miss the hard bits of the alignment challenge

Deep Deceptiveness

A central AI alignment problem: capabilities generalization, and the sharp left turn

Visible Thoughts Project and Bounty Announcement

Nate Soares

On how various plans miss the hard bits of the alignment challenge

Deep Deceptiveness

A central AI alignment problem: capabilities generalization, and the sharp left turn

Visible Thoughts Project and Bounty Announcement

Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

AI as a science, and three obstacles to alignment strategies

Cosmopolitan values don't come free

Sentience matters

But why would the AI kill us?

Misgeneralization as a misnomer

If interpretability research goes well, it may get dangerous