Alex Flint

Independent AI alignment researcher


The accumulation of knowledge

Wiki Contributions


This is a relatively banal meta-commentary on reasons people sometimes give for doing worst-case analysis, and the differences between those reasons. The post reads like a list of things with no clear through-line. There is a gesture at an important idea from a Yudkowsky post (the logistic success curve idea) but the post does not helpfully expound that idea. There is a kind of trailing-off towards the end of the post as things like "planning fallacy" seem to have been added to the list with little time taken to place them in the context of the other things on the list. In the "differences between these arguments" section, the post doesn't clearly elucidate deep differences between the arguments, it just lists verbal responses that you might make if you are challenged on plausibility grounds in each case.

Overall, I felt that this post under-delivered on an important topic.

Many people believe that they already understand Dennett's intentional stance idea, and due to that will not read this post in detail. That is, in many cases, a mistake. This post makes an excellent and important point, which is wonderfully summarized in the second-to-last paragraph:

In general, I think that much of the confusion about whether some system that appears agent-y “really is an agent” derives from an intuitive sense that the beliefs and desires we experience internally are somehow fundamentally different from those that we “merely” infer and ascribe to systems we observe externally. I also think that much of this confusion dissolves with the realization that internally experienced thoughts, beliefs, desires, goals, etc. are actually “external” with respect to the parts of the mind that are observing them—including the part(s) of the mind that is modeling the mind-system as a whole as “being an agent” (or a “multiagent mind,” etc.). You couldn't observe thoughts (or the mind in general) at all if they weren't external to "you" (the observer), in the relevant sense.

The real point of the intentional stance idea is that there is no fact of the matter about whether something really is an agent, and that point is most potent when applied to ourselves. It is neither the case that we really truly are an agent, nor that we really truly are not an agent.

This post does an excellent job of highlighting this facet. However, I think this post could have been more punchy. There is too much meta-text of little value, like this paragraph:

In an attempt to be as faithful as possible in my depiction of Dennett’s original position, as well as provide a good resource to point back to on the subject for further discussion[1], I will err on the side of directly quoting Dennett perhaps too frequently, at least in this summary section.

In a post like this, do we need to be fore-warned that the author will err perhaps to frequently on the side of directly quoting Dennett, at least in the summary section? No, we don't need to know that. In fact the post does not contain all that many direct quotes.

At the top of the "takeaways" section, the author gives the following caveat:

Editorial note: To be clear, these “takeaways” are both “things Dan Dennett is claiming about the nature of agency with the intentional stance” and “ideas I’m endorsing in the context of deconfusing agency for AI safety.”

The word "takeaways" in the heading already tells us that this section will contain points extracted by the reader that may or may not be explicitly endorsed by the original author. There is no need for extra caveats, it just leads to a bad reading experience.

In the comments section, Rohin makes the following very good point:

I mostly agree with everything here, but I think it is understating the extent to which the intentional stance is insufficient for the purposes of AI alignment. I think if you accept "agency = intentional stance", then you need to think "well, I guess AI risk wasn't actually about agency".

Although we can "see through" agency as not-an-ontologically-fundamental-thing, nevertheless we face the practical problem of what to do about the (seemingly) imminent destruction of the world by powerful AI. What actually should we do about that? The intentional stance not only fails to tell us what to do, it also fails to tell us how any approach to averting AI risk can co-exist with the powerful deconstruction of agency offered by the intentional stance idea itself. If agency is in the eye of the beholder, then... what? What do we actually do about AI risk?

Have you personally ever ridden in a robot car that has no safety driver?

This post consists of comments on summaries of a debate about the nature and difficulty of the alignment problem. The original debate was between Eliezer Yudkowsky and Richard Ngo but this post does not contain the content from that debate. This posts is mostly of commentary by Jaan Tallinn on that debate, with comments by Eliezer.

The post provides a kind of fascinating level of insight into true insider conversations about AI alignment. How do Eliezer and Jaan converse about alignment? Sure, this is a public setting, so perhaps they communicate differently in private. But still. Read the post and you kind of see the social dynamics between them. It's fascinating, actually.

Eliezer is just incredibly doom-y. He describes in fantastic detail the specific ways that a treacherous turn might play out, over dozens of paragraphs, 3 levels deep in a one-on-one conversation, in a document that merely summarizes a prior debate on the topic. He uses Capitalized Terms to indicate that things like "Doomed Phase" and "Terminal Phase" and "Law of Surprisingly Undignified Failure" are not merely for one time use but in fact refer to specific nodes in a larger conceptual framework.

One thing that happens often is that Jaan asks a question, Eliezer gives an extensive reply, and then Jaan response that, no, he was actually asking a different question.

There is one point where Jaan describes his frustration over the years with mainstream AI researchers objecting to AI safety arguments as being invalid due to anthropomorphization, when in fact the arguments were not invalidly anthropomorphizing. There is a kind of gentle vulnerability in this section that is worth reading seriously.

There is a lot of swapping of models of others in and outside the debate. Everyone is trying to model everyone all the time.

Eliezer does unfortunately like to explicitly underscore his own brilliance. He says things like:

I consider all of this obvious as a convergent instrumental strategy for AIs. I could probably have generated it in 2005 or 2010 [...]

But it's clear enough that probably nobody was ever going to pass the validation set for generating lines of reasoning obvious enough to be generated by Eliezer in 2010 or possibly 2005

I do think that the content itself really comes down to the same basic question tackled in the original Hanson/Yudkowsky FOOM debate. I understand that this debate was ostensibly a broader question than FOOM. In practice I don't think this discourse has actually moved on much since 2008.

The main thing the FOOM debate is missing, in my opinion, is this: we have almost no examples of AI systems that can do meaningful sophisticated things in the physical world. Self-driving cars still aren't a reality. Walk around a city or visit an airport or drive down a highway, and you see shockingly few robots, and certainly no robots pursuing even the remotest kind of general-purpose tasks. Demo videos of robots doing amazing, scary, general-purpose things abound, but where are these robots in the real world? They are always just around the corner. Why?

Thanks - fixed! And thank you for the note, too.

Yeah it might just be a lack of training data in 10-second-or-less interactive instructions.

The thing I really wanted to test with this experiment was actually whether ChatGPT could engage with the real world using me as a guinea pig. The 10-second-or-less thing was just the format I used to try to "get at" the phenomenon of engaging with the real world. I'm interested in improving the format to more cleanly get at the phenomenon.

I do currently have the sense that it's more than just a lack of training data. I have the sense that ChatGPT has learned much less about how the world really works at a causal level than it appears from much of its dialog. Specifically, I have the sense that it has learned how to satisfy idle human curiosity using language, in a way that largely routes around a model of the real world, and especially routes around a model of the dynamics of the real world. That's my hypothesis -- I don't think this particular experiment has demonstrated it yet.

I asked a group of friends for "someone to help me with an AI experiment" and then I gave this particular friend the context that I wanted her help guiding me through a task via text message and that she should be in front of her phone in some room that was not the kitchen.

I asked a group of friends for "someone to help me with an AI experiment" and then I gave this particular friend the context that I wanted her help guiding me through a task via text message and that she should be in front of her phone in some room that was not the kitchen.

If you look at how ChatGPT responds, it seems to be really struggling to "get" what's happening in the kitchen -- it never really comes to the point of giving specific instructions, and especially never comes to the point of having any sense of the "situation" in the kitchen -- e.g. whether the milk is currently in the suacepan or not.

In contrast, my human friend did "get" this in quite a visceral way (it seems to me). I don't have the sense that this was due to out-of-band context but I'd be interested to retry the experiment with more carefully controlled context.

I'm very interested in Wei Dai's work, but I haven't followed closely in recent years. Any pointers to what I might read of his recent writings?

I do think Eliezer tackled this problem in the sequences, but I don't really think he came to an answer to these particular questions. I think what he said about meta-ethics is that it is neither that there is some measure of goodness to be found in the material world independent from our own minds, nor that goodness is completely open to be constructed based on our whims or preferences. He then says "well there just is something we value, and it's not arbitrary, and that's what goodness is", which is fine, except it still doesn't tell us how to find that thing or extrapolate it or verify it or encode it into an AI. So I think his account of meta-ethics is helpful but not complete.

Did you ever end up reading Reducing Goodhart?

Not yet, but I hope to, and I'm grateful to you for writing it.

processes for evolving humans' values that humans themselves think are good, in the ordinary way we think ordinary good things are good

Well, sure, but the question is whether this can really be done by modelling human values and then evolving those models. If you claim yes then there are several thorny issues to contend with, including what constitutes a viable starting point for such a process, what is a reasonable dynamic for such a process, and on what basis we decide the answers to these things.

Load More