Steve Byrnes

I'm an AGI safety researcher in Boston, MA, USA, with a particular focus on brain algorithms. See Email:

Wiki Contributions


Goodhart: Endgame


I guess I was just thinking, sometimes every option is out-of-distribution, because the future is different than the past, especially when we want AGIs to invent new technologies etc.

I agree that adversarially-chosen OOD hypotheticals are very problematic.

I think Stuart Armstrong thinks the end goal has to be a utility function because utility-maximizers are in reflective equilibrium in a way that other systems aren't; he talks about that here.

Goodhart: Endgame

I'm gonna try to summarize and then you can tell me what I'm missing:

  • In weird out-of-distribution situations, my preferences / values are ill-defined
  • We can operationalize that by having an ensemble of models of my preferences / values, and seeing that they give different, mutually-incompatible predictions in these weird out-of-distribution situations
  • One thing we can do to help is set up our AI to avoid taking us into weird out-of-distribution situations where my preferences are ill-defined.
  • Another thing we can do to help is have meta-preferences about how to deal with situations where my preferences are ill-defined, and have the AI learn those meta-preferences.
  • Another thing is, we implicitly trust our own future preferences in weird out-of-distribution situations, because what else can we do? So we can build an AI that we trust for a similar reason: either (A) it's transparent, and we train it to do human-like things for human-like reasons, or (B) it's trained to imitate human cognition.

Is that fair? I'm not agreeing or disagreeing, just parsing.

I'd also be interested in a compare/contrast with, say, this Stuart Armstrong post.

Models Modeling Models

Hmm. I think you missed my point…

There are two different activities:

ACTIVITY A: Think about how an AI will form a model of what a human wants and is trying to do.

ACTIVITY B: Think about the gears underlying human intelligence and motivation.

You're doing Activity A every day. I'm doing Activity B every day.

My comment was trying to say: "The people like you, doing Activity A, may talk about there being multiple models which tend to agree in-distribution but not OOD. Meanwhile, the people like me, doing Activity B, may talk about subagents. There's a conceptual parallel between these two different discussions."

And I think you thought I was saying: "We both agree that the real ultimate goal right now is Activity A. I'm leaving a comment that I think will help you engage in Activity A, because Activity A is the thing to do. And my comment is: (something about humans having subagents)."

Does that help?

How To Get Into Independent Research On Alignment/Agency

the sort of person who this post is already aimed at (i.e. people who are excited to forge their own path in a technical field where everyone is fundamentally confused) is probably not the sort of person who is aiming for minor contributions anyway.

For me, there were two separate decisions. (1) Around March 2019, having just finished my previous intense long-term internet hobby, I figured my next intense long-term internet hobby was gonna be AI alignment; (2) later on, around June 2020, I started trying to get funding for full-time independent work. (I couldn't work at an org because I didn't want to move to a different city.)

I want to emphasize that at the earlier decision-point, I was absolutely "aiming for minor contributions". I didn't have great qualifications, or familiarity with the field, or a lot of time. But I figured that I could eventually get to a point where I could write helpful comments on other people's blog posts. And that would be my contribution!

Well, I also figured I should be capable of pedagogy and outreach. And that was basically the first thing I did—I wrote a little talk summarizing the field for newbies, and gave it to one audience, and tried and failed to give it to a second audience.

(I find it a lot easier to "study topic X, in order to do Y with that knowledge", compared to "study topic X" full stop. Just starting out on my new hobby, I had no Y yet, so "giving a pedagogical talk" was an obvious-to-me choice of Y.)

Then I had some original ideas! And blogged about them. But they turned out to be bad.

Then I had different original ideas! And blogged about them in my free time for like a year before I applied for LTFF.

…and they rejected me. On the plus side, their rejection came with advice about exactly what I was missing if I wanted to reapply. On the minus side, the advice was pretty hard to follow, given my time constraints. So I started gradually chipping away at the path towards getting those things done. But luckily I wound up getting a different grant a few months later (yay).

With that background, a few comments on the post:

I wrote a fair bit on LessWrong, and researched some agency problems, even before quitting my job. I do expect it helps to “ease into it” this way, and if you’re coming in fresh you should probably give yourself extra time to start writing up ideas, following the field, and getting feedback.

I also went down the "ease into it" path. It's especially (though not exclusively) suitable for people like me who are OK with long-term intense internet hobbies. (AI alignment was my 4th long-term intense internet hobby in my lifetime. Probably last. They are frankly pretty exhausting, especially with a full-time job and kids.)

Probably the most common mistake people make when first attempting to enter the alignment/agency research field is to not have any model at all of the main bottlenecks to alignment, or how their work will address those bottlenecks.

Just to clarify:

This quote makes sense to me if you read "when first attempting to enter the field" as meaning "when first attempting to enter the field as a grant-funded full-time independent researcher".

On the other hand, when you're first attempting to learn about and maybe dabble in the field, well obviously you won't have a good model of the field yet.

One more thing:

the sort of person who this post is already aimed at (i.e. people who are excited to forge their own path in a technical field where everyone is fundamentally confused) is probably not the sort of person who is aiming for minor contributions anyway.

If you're a kinda imposter-syndrome-y person who just constitutionally wouldn't dream of looking themselves in the mirror and saying "I am aiming for a major contribution!", well me too, and don't let John scare you off. :-P

I can attest that it’s an awesome job.

I agree!

Ngo and Yudkowsky on alignment difficulty

I agree that that is another failure mode. (And there are yet other failure modes too—e.g. instead of printing the nanobot plan, it prints "Help me I'm trapped in a box…" :-P . I apologize for sloppy wording that suggested the two things I mentioned were the only two problems.)

I disagree about "more central". I think that's basically a disagreement on the question of "what's a bigger deal, inner misalignment or outer misalignment?" with you voting for "outer" and me voting for "inner, or maybe tie, I dunno". But I'm not sure it's a good use of time to try to hash out that disagreement. We need an alignment plan that solves all the problems simultaneously. Probably different alignment approaches will get stuck on different things.

Ngo and Yudkowsky on alignment difficulty

Speaking for myself here…

OK, let's say we want an AI to make a "nanobot plan". I'll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.

First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it's not, it's a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.

Alternatively, maybe we're able to thoroughly understand the plan once we see it; we're just too stupid to come up with it ourselves. That seems awfully fraught—I'm not sure how we could be so confident that we can tell apart nanobot plans from booby-trap plans. But let's assume that's possible for the sake of argument, and then move on to the other type of accident risk:

Second, I need to worry that the AI will start running, and I think it's coming up with a nanobot plan, but actually it's hacking its way out of its box and taking over the world.

How and why might that happen?

I would say that if a nanobot plan is very hard to create—requiring new insights etc.—then the only way to do it is to create the nanobot plan is to construct an agent-like thing that is trying to create the nanobot plan.

The agent-like thing would have some kind of action space (e.g. it can choose to summon a particular journal article to re-read, or it can choose to think through a certain possibility, etc.), and it would have some kind of capability of searching for and executing plans (specifically, plans-for-how-to-create-the-nanobot-plan), and it would have a capability of creating and executing instrumental subgoals (e.g. go on a side-quest to better understand boron chemistry) and plausibly it needs some kind of metacognition to improve its ability to find subgoals and take actions.

Everything I mentioned is an "internal" plan or an "internal" action or an "internal" goal, not involving "reaching out into the world" with actuators and internet access and nanobots etc.

If only the AI would stick to such "internal" consequentialist actions (e.g. "I will read this article to better understand boron chemistry") and not engage in any "external" consequentialist actions (e.g. "I will seize more computer power to better understand boron chemistry"), well then we would have nothing to worry about! Alas, so far as I know, nobody knows how to make a powerful AI agent that would definitely always stick to "internal" consequentialism.

Discussion with Eliezer Yudkowsky on AGI interventions

An example that springs to my mind is Abram wrote a blog post in 2018 mentioning the "easy problem of wireheading". He described both the problem and its solution in like one sentence, and then immediately moved on to the harder problems.

Later on, DeepMind did an experiment that (in my assessment) mostly just endorsed what Abram said as being correct.

For the record, I don't think that particular DeepMind experiment was zero value, for various reasons. But at the same time, I think that Abram wins hands-down on the metric of "progress towards AI alignment per researcher-hour", and this is true at both the production and consumption end (I can read Abram's one sentence much much faster than I can skim the DeepMind paper).

If we had a plausible-to-me plan that gets us to safe & beneficial AGI, I would be really enthusiastic about going back and checking all the assumptions with experiments. That's how you shore up the foundations, flesh out the details, start developing working code and practical expertise, etc. etc. But I don't think we have such a plan right now.

Also, there are times when it's totally unclear a priori what an algorithm will do just by thinking about it, and then obviously the experiments are super useful.

But at the end of the day, I feel like there are experiments that are happening not because it's the optimal thing to do for AI alignment, but rather because there are very strong pro-experiment forces that exist inside CS / ML / AI research in academia and academia-adjacent labs.

Discussion with Eliezer Yudkowsky on AGI interventions

if EY and other MIRI people who are very dubious of most alignment research could give more feedback on that and enter the dialogue, maybe by commenting more on the AF. My problem is not so much with them disagreeing with most of the work, it’s about the disagreement stopping to “that’s not going to work” and not having dialogue and back and forth.

Just in case anyone hasn't already seen these, EY wrote Challenges to Christiano’s capability amplification proposal and this comment (that I already linked to in a different comment on this page) (also has a reply thread), both in 2018. Also The Rocket Alignment Problem.

Discussion with Eliezer Yudkowsky on AGI interventions

Eliezer explains why he thinks corrigibility is unnatural in this comment.

Discussion with Eliezer Yudkowsky on AGI interventions

Couple things:

First, there is a lot of work in the "alignment community" that involves (for example) decision theory or open-source-game-theory or acausal trade, and I haven't found any of it helpful for what I personally think about (which I'd like to think is "directly attacking the heart of the problem", but others may judge for themselves when my upcoming post series comes out!).

I guess I see this subset of work as consistent with the hypothesis "some people have been nerd-sniped!". But it's also consistent with "some people have reasonable beliefs and I don't share them, or maybe I haven't bothered to understand them". So I'm a bit loath to go around criticizing them, without putting more work into it. But still, this is a semi-endorsement of one of the things you're saying.

Second, my understanding of MIRI (as an outsider, based purely on my vague recollection of their newsletters etc., and someone can correct me) is that (1) they have a group working on "better understand agent foundations", and this group contains Abram and Scott, and they publish pretty much everything they're doing, (2) they have a group working on undisclosed research projects, which are NOT "better understand agent foundations", (3) they have a couple "none of the above" people including Evan and Vanessa. So I'm confused that you seem to endorse what Abram and Scott are doing, but criticize agent foundations work at MIRI.

Like, maybe people "in the AI alignment community" are being nerd-sniped, and maybe MIRI had a historical role in how that happened, but I'm not sure there's any actual MIRI employee right now who is doing nerd-sniped-type work, to the best of my limited understanding, unless we want to say Scott is, but you already said Scott is OK in your book.

(By the way, hot takes: I join you in finding some of Abram's posts to be super helpful, and would throw Stuart Armstrong onto the "super helpful" list too, assuming he counts as "MIRI". As for Scott: ironically, I find logical induction very useful when thinking about how to build AGI, and somewhat less useful when thinking about how to align it. :-P I didn't get anything useful for my own thinking out of his Cartesian frames or finite factored sets, but as above, that could just be me; I'm very loath to criticize without doing more work, especially as they're works in progress, I gather.)

Load More