Nate Soares explains why he doesn't expect an unaligned AI to be friendly or cooperative with humanity, even if it uses logical decision theory. He argues that even getting a small fraction of resources from such an AI is extremely unlikely.
This post starts out pretty gloomy but ends up with some points that I feel pretty positive about. Day to day, I'm more focussed on the positive points, but awareness of the negative has been crucial to forming my priorities, so I'm going to start with those. I'm mostly addressing the EA community here, but hopefully this post will be of some interest to LessWrong and the Alignment Forum as well.
I think AGI is going to be developed soon, and quickly. Possibly (20%) that's next year, and most likely (80%) before the end of 2029. These are not things you need to believe for yourself in order to understand my view, so no worries if you're not personally convinced of this.
(For what...
Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.
Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.
All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.
This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns...
Thanks for the feedback! I partially agree with your thoughts overall.
All three categorizes of maximally fit motivations could lead to aligned or misaligned behavior in deployment.
This is technically true, though I think that schemers are far more dangerous than fitness-seekers. IMO, more likely than not, a fitness-seeker would behave similarly in deployment as compared to training, and its misaligned preferences are likely more materially and temporally bounded. Meanwhile, misaligned schemers seem basically worst-case likely to takeover. Even if you...
New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread.
Abstract
...Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating risks from advanced AI systems. But evaluating the reliability and efficacy of a proposed deception detector requires examples that we can confidently label as either deceptive or honest. We argue that we currently lack the necessary examples and further identify several concrete obstacles in collecting them. We provide evidence from conceptual arguments, analysis of existing empirical works, and analysis of novel illustrative case studies. We also discuss the potential of several proposed empirical workarounds to these problems and argue that while
[This has the same content as my shortform here; sorry for double-posting, I didn't see this LW post when I posted the shortform.]
Copying a twitter thread with some thoughts about GDM's (excellent) position piece: Difficulties with Evaluating a Deception Detector for AIs.
Research related to detecting AI deception has a bunch of footguns. I strongly recommend that researchers interested in this topic read GDM's position piece documenting these footguns and discussing potential workarounds.
More reactions in
-
First, it's worth saying that I've found making pr...
To quickly transform the world, it's not enough for AI to become super smart (the "intelligence explosion").
AI will also have to turbocharge the physical world (the "industrial explosion"). Think robot factories building more and better robot factories, which build more and better robot factories, and so on.
The dynamics of the industrial explosion has gotten remarkably little attention.
This post lays out how the industrial explosion could play out, and how quickly it might happen.
We think the industrial explosion will unfold in three stages:
(I think epoch's paper on this takes a different approach and suggests an outside view of hyperbolic growth lasting for ~1.5y OOMs without bottlenecks, because that was the amount grown between the agricultural evolution and the population bottleneck starting. That feels weaker to me than looking at more specific hypotheses of bottlenecks, and I do think epoch's overall view is that it'll likely be more than 1.5 OOMs. But wanted to flag it as another option for an outside view estimate.)
I didn't believe the theory of change at the time and still don't. The post doesn't really make a full case for it, and I doubt it really convinced anyone to work on this for the right reasons.
- No BOTEC
- Models of social dynamics are handwavy
- In alignment and other safety work, we have force multipliers like neglectedness and influencing government. What's the multiplier here? Or is there no intervention that has multiplier effects in the long-term multipolar risk space?
- Why wouldn't the effect just get swamped by the dozens of medical AI startups that are rais
... (read more)