I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered.
Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments.
As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many...
EY was not in fact bullish on neural networks leading to impressive AI capabilities. Eliezer said this directly:
I'm no fan of neurons; this may be clearer from other posts.
I think this is strong evidence for my interpretation of the quotes in my parent comment: He's not just mocking the local invalidity of reasoning "because humans have lots of neurons, AI with lots of neurons -> smart", he's also mocking neural network-driven hopes themselves.
More quotes from Logical or Connectionist AI?:
Not to mention that neural networks have also been "fai
From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.
However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, ...
Maybe I am confused by what you mean by . I thought it was the state space, but that isn't consistent with in your post which was defined over ?
I'm not entirely sure what you mean by the state space. is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is , not . I slightly abused notation by defining in the parent comment. Let's say it's and is...
When we're trying to do AI alignment, we're often studying systems which don't yet exist. This is a pretty weird epistemic activity, and seems really hard to get right. This post offers one frame for thinking about what we're actually doing when we're thinking about AI alignment: using parts of the space of maps to reason about parts of the space of intelligent systems.
In this post, we:
We hope that the content is mostly the second kind of obvious: obvious once you see things in this way, which you maybe already do. In our experience, this comes with a risk: reading too fast, you may miss most of the nuance and useful insight...
Status: This was a response to a draft of Holden's cold take "AI safety seems hard to measure". It sparked a further discussion, that Holden recently posted a summary of.
The follow-up discussion ended up focusing on some issues in AI alignment that I think are underserved, which Holden said were kinda orthogonal to the point he was trying to make, and which didn't show up much in the final draft. I nevertheless think my notes were a fine attempt at articulating some open problems I see, from a different angle than usual. (Though it does have some overlap with the points made in Deep Deceptiveness, which I was also drafting at the time.)
I'm posting the document I wrote to Holden with only minimal editing, because it's been a few...
And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.
In case it's of any interest, I'll mention that when I "pump this intuition", I find myself thinking it essentially impossible to expect we could ever build a general agent that didn't notice that the world was round, and I'm unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it's quit...
This post presents five closely-related ways to achieve proof-based cooperation without using Löb's theorem, and muses on legible cooperation in the real world.
(Edit: maybe they're closer to just-use-Löb's-theorem than I originally thought! See this comment. If these constructions somehow work better, I'm more confused than before about why.)
I'm writing this as a follow-up to Andrew Critch's recent post, to share more of my perspective on the subject.
We're going to dive straight into the weeds. (I'm planning to also write a more accessible explainer post soon.)
I claim the following are sufficient for robust cooperation:
tries to prove that , and tries to prove . The reason this works is that can prove that , i.e. only cooperates in ways...
Something I'm now realizing, having written all these down: the core mechanism really does echo Löb's theorem! Gah, maybe these are more like Löb than I thought.
(My whole hope was to generalize to things that Löb's theorem doesn't! And maybe these ideas still do, but my story for why has broken, and I'm now confused.)
As something to ponder on, let me show you how we can prove Löb's theorem following the method of ideas #3 and #5: