Steve Byrnes

I'm an AGI safety researcher in Boston, MA, USA, with a particular focus on brain algorithms. See Email:

Wiki Contributions


General alignment plus human values, or alignment via human values?


  1. I want the AI to have criteria that qualifies actions as acceptable, e.g. "it pattern-matches less than 1% to 'I'm causing destruction', and it pattern-matches less than 1% to 'the supervisor wouldn't like this', and it pattern-matches less than 1% to 'I'm changing my own motivation and control systems', and … etc. etc."
  2. If no action is acceptable, I want NOOP to be hardcoded as an always-acceptable default—a.k.a. "being paralyzed by indecision" in the face of a situation where all the options seem problematic. And then we humans are responsible for not putting the AI in situations where fast decisions are necessary and inaction is dangerous, like running the electric grid or driving a car. 

    (At some point we do want an AI that can run the electric grid and drive a car etc. But maybe we can bootstrap our way there, and/or use less-powerful narrow AIs in the meantime.)
  3. A failure mode of (2) is that we could get an AI that is paralyzed by indecision always, and never does anything. To avoid this failure mode, we want the AI to be able to (and motivated to) gather evidence that might show that a course of action deemed problematic is in fact acceptable after all. This would probably involve asking questions to the human supervisor.
  4. A failure mode of (3) is that the AI frames the questions in order to get an answer that it wants. To avoid this failure mode, we would set things up such that the AI's normal motivation system is not in charge of choosing what words to say when querying the human. For example, maybe the AI is not really "asking a question" at all, at least not in the normal sense; instead it's sending a data-dump to the human, and the human then inspects this data-dump with interpretability tools, and makes an edit to the AI's motivation parameters. (In this case, maybe the AI's normal motivation system is choosing to "press the button" that sends the data-dump, but it does not have direct control over the contents of the data-dump.) Separately, we would also set up the AI such that it's motivated to not manipulate the human, and also motivated to not sabotage its own motivation and control systems.

(BTW a lot of my thinking here came straight out of reading your model splintering posts. But maybe I've kinda wandered off in a different direction.)

So then in the scenario you mentioned, let's assume that we've set up the AI such that actions that pattern-match to "push the world into uncharted territory" are treated as unacceptable (which I guess seems like a plausibly good idea). But the AI is also motivated to get something done—say, solve global warming. And it finds a possible course of action which pattern-matches very well to "solve global warming", but alas, it also pattern-matches to "push the world into uncharted territory". The AI could reason that, if it queries the human (by "pressing the button" to send the data-dump), there's at least a chance that the human would edit its systems such that this course of action would no longer be unacceptable. So it would presumably do so.

In other words, this is a situation where the AI's motivational system is sending it mixed signals—it does want to "solve global warming", but it doesn't want to "push the world into uncharted territory", but this course of action is both. And let's assume that the AI can't easily come up with an alternative course of action that would solve global warming without any problematic aspects. So the AI asks the human what they think about this plan. Seems reasonable, I guess.

I haven't thought this through very much and look forward to you picking holes in it :)

P₂B: Plan to P₂B Better

How about "if I contain two subagents with different goals, they should execute Pareto-improving trades with each other"? This is an aspect of "becoming more rational", but it's not very well described by your maxim, because the maxim includes "your goal" as if that's well defined, right?

Unrelated topic: Maybe I didn't read carefully enough, but intuitively I treat "making a plan" and "executing a plan" as different, and I normally treat the word "planning" as referring just to the former, not the latter. Is that what you mean? Because executing a plan is obviously necessary too ....

General alignment plus human values, or alignment via human values?

Ah, so you are arguing against (3)? (And what's your stance on (1)?)

Let's say you are assigned to be Alice's personal assistant.

  • Suppose Alice says "Try to help me as much as you can, while being VERY sure to avoid actions that I would regard as catastrophically bad. When in doubt, just don't do anything at all, that's always OK with me." I feel like Alice is not asking too much of you here. You'll observe her a lot, and ask her a lot of questions especially early on, and sometimes you'll fail to be useful, because helping her would require choosing among options that all seem fraught. But still, I feel like this is basically doable. And pretty robust, because you'll presumably only take actions when you have many independent lines of evidence that those actions are acceptable—e.g. you've seen Alice do similar things, and you've seen other people do similar things while Alice watched and she seemed happy, and also you explicitly asked Alice and she said it was fine.
  • Suppose Alice says "You need to distill my preferences into a utility function, and then go all-out, taking actions that set that utility function to its global maximum. So in particular, in every possible situation, no matter how bizarre, you will have preferences that match my preferences [or match the preferences that I would have reached upon deliberating following my meta-preferences, or whatever]." I feel like Alice is asking for something very very hard here. And that it's much more prone to catastrophic failure if anything goes wrong in the construction of the utility function—e.g. Alice gets confused and describes something wrong, or you misunderstand her.


But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.

Hmm, I'm probably misunderstanding, but I feel like maybe you're making an argument like this:

(My probably-inaccurate elaboration of your argument.) We're making an extremely long list of the things that Alice cares about: "I like having all my teeth, and I like being able to watch football, and I like a pretty view out my window, etc. etc. etc." And each item that we add to the list costs one unit of value-alignment effort. And then "acting conservatively in regards to violating human preferences and norms in general, and in regards to Alice's preferences in particular" requires a very long list, and "synthesizing Alice's utility function" requires an only-slightly-longer list. Therefore we might as well do the latter.

But I don't think it's like that. For example, I think if an AGI watches a bunch of YouTube videos, it will be able to form a decent concept of "doing things that people would widely regard as uncontroversial and compatible with prevailing norms", and we can make it motivated to restrict its actions to that subspace via a constant amount of value-loading effort, i.e. with an amount of value-loading effort that does not scale with how complex those prevailing norms are. (More complex prevailing norms would require having the AGI watch more YouTube videos before it understands the prevailing norms, but it would not require more value-loading effort, i.e. the step where we edit the AGI's motivation such that it wants to follow prevailing norms would not be any harder.)

But I think it would take a lot more value-loading effort than that to really get a particular person's preferences, including all its idiosyncrasies and edge-cases.

General alignment plus human values, or alignment via human values?

Here are three things that I believe:

  1. "aiming the AGI's motivation at something-in-particular" is a different technical research problem from "figuring out what that something-in-particular should be", and we need to pursue both these research problems in parallel, since they overlap relatively little.
  2. There is no circumstance where any reasonable person would want to build an AGI whose motivation has no connection whatsoever to human values / preferences / norms.
  3. We don't necessarily want to do "ambitious value alignment"—i.e., to build an AGI that fully understands absolutely everything we want and care about in life and adopt those goals as its own, such that if I disappear in a puff of smoke the AGI can continue pursuing my goals and meta-goals in my stead. 

For example, I feel like it should be possible to make an AGI that understands human values and preferences well enough to reliably and conservatively avoid doing things that humans would see as obviously or even borderline unacceptable / problematic. So if you put it in the trolley problem, it says "I don't know, neither of those options seems obviously acceptable, so I am going to default to NOOP and let my supervisor take actions." Meanwhile, the AGI is also motivated to make me a cup of tea. Such an AGI seems pretty good to me. But it's contrary to (3).

I think this post is mainly arguing in favor of (2), and maybe weakly / implicitly arguing against (1). Is that right? And I'm not sure whether it's for or against (3).

Brain-inspired AGI and the "lifetime anchor"

Note that assumption 4 also needs to include a claim that we figure out the "secret sauce" sooner than other paths to AGI, despite lots of effort being put into them already.

Yup, "time until AGI via one particular path" is always an upper bound to "time until AGI". I added a note, thanks.

These seem easily like the load-bearing part of the argument; I agree the stuff you listed follows from these assumptions but why should these assumptions be true? 

The only thing I'm arguing in this particular post is "IF assumptions THEN conclusion". This post is not making any argument whatsoever that you should put a high credence on the assumptions being true. :-)

Safety-capabilities tradeoff dials are inevitable in AGI

The I part I'll agree with is: If we look at a dial, we can ask the question:

If there's an AGI with a safety-capabilities tradeoff dial, to what extent is the dial's setting externally legible / auditable to third parties?

More legible / auditable is better, because it could help enforcement.

I agree with this, and I have just added it to the article. But I disagree with your suggestion that this is counter to what I wrote. In my mind, it's an orthogonal dimension along which dials can vary. I think it's good if the dial is auditable, and I think it's also good if the dial corresponds to a very low alignment tax rate.

I interpret your comment as saying that the alignment tax rate doesn't matter because there will be enforcement, but I disagree with that. I would invoke an analogy to actual taxes. It is already required and enforced that individuals and companies pay (normal) taxes. But everyone knows that a 0.1% tax on Thing X will have a higher compliance rate than an 80% tax on Thing X, other things equal.

After all, everyone is making decisions about whether to pay the tax, versus not pay the tax. Not paying the tax has costs. It's a cost to hire lawyers that can do complicated accounting tricks. It's a cost to run the risk of getting fined or imprisoned. It's a cost to pack up your stuff and move to an anarchic war zone, or to a barge in the middle of the ocean, etc. It's a cost to get pilloried in the media for tax evasion. People will ask themselves: are these costs worth the benefits? If the tax is 0.1%, maybe it's not worth it, maybe it's just way better to avoid all that trouble by paying the tax. If the tax is 80%, maybe it is worth it to engage in tax evasion.

So anyway, I agree that "there will be good enforcement" is plausibly part of the answer. But good enforcement plus low tax will sum up to higher compliance than good enforcement by itself. Unless you think "perfect watertight enforcement" is easy, so that "willingness to comply" becomes completely irrelevant. That strikes me as overly optimistic. Perfect watertight enforcement of anything is practically nonexistent in this world. Perfect watertight enforcement of experimental AGI research would strike me as especially hard. After all, AGI research is feasible to do in a hidden basement / anarchic war zone / barge in the middle of the ocean / secret military base / etc. And there are already several billion GPUs untraceably dispersed all across the surface of Earth.

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Great post, thanks for writing it!!

The links to are down at the moment, is that temporary, or did the website move or something?

Is it fair to say that all the things you're doing with multi-objective RL could also be called "single-objective RL with a more complicated objective"? Like, if you calculate the vector of values V, and then use a scalarization function S, then I could just say to you "Nope, you're doing normal single-objective RL, using the objective function S(V)." Right?

(Not that there's anything wrong with that, just want to make sure I understand.)

…this pops out at me because the two reasons I personally like multi-objective RL are not like that. Instead they're things that I think you genuinely can't do with one objective function, even a complicated one built out of multiple pieces combined nonlinearly. Namely, (1) transparency/interpretability [because a human can inspect the vector V], and (2) real-time control [because a human can change the scalarization function on the fly]. Incidentally, I think (2) is part of how brains work; an example of the real-time control is that if you're hungry, entertaining a plan that involves eating gets extra points from the brainstem/hypothalamus (positive coefficient), whereas if you're nauseous, it loses points (negative coefficient). That's my model anyway, you can disagree :) As for transparency/interpretability, I've suggested that maybe the vector V should have thousands of entries, like one for every word in the dictionary … or even millions of entries, or infinity, I dunno, can't have too much of a good thing. :-)

Force neural nets to use models, then detect these

I was writing a kinda long reply but maybe I should first clarify: what do you mean by "model"? Can you give examples of ways that I could learn something (or otherwise change my synapses within a lifetime) that you wouldn't characterize as "changes to my mental model"? For example, which of the following would be "changes to my mental model"?

  1. I learn that Brussels is the capital of Belgium
  2. I learn that it's cold outside right now
  3. I taste a new brand of soup and find that I really like it
  4. I learn to ride a bicycle, including
    1. maintaining balance via fast hard-to-describe responses where I shift my body in certain ways in response to different sensations and perceptions
    2. being able to predict how the bicycle and me would move if I swung my arm around
  5. I didn't sleep well so now I'm grumpy

FWIW my inclination is to say that 1-4 are all "changes to my mental model". And 5 involves both changes to my mental model (knowing that I'm grumpy), and changes to the inputs to my mental model (I feel different "feelings" than I otherwise would—I think of those as inputs going into the model, just like visual inputs go into the model). Is there anything wrong / missing / suboptimal about that definition?

Force neural nets to use models, then detect these

This one kinda confuses me. I'm of the opinion that the human brain is "constructed with a model explicitly, so that identifying the model is as simple as saying "the model is in this sub-module, the one labelled 'model'"." Of course the contents of the model are learned, but I think the question of whether any particular plastic synapse is or is not part of the information content of the model will have a straightforward yes-or-no answer. If that's right, then "it's hard to find the model (if any) in a trained model-free RL agent" is a disanalogy to "AIs learning human values". It would be more analogous to just train a MuZero clone, which has a labeled "model" component, instead of training a model-free RL.

And then looking at weights and activations would also be disanalogous to "AIs learning human values", since we probably won't have those kinds of real-time-brain-scanning technologies, right?

Sorry if I'm misunderstanding.

My take on Vanessa Kosoy's take on AGI safety

Ah, you mean that "alignment" is a different problem than "subhuman and human-imitating training safety"? :P

"Quantilizing from the human policy" is human-imitating in a sense, but also superhuman. At least modestly superhuman - depends on how hard you quantilize. (And maybe very superhuman in speed.)

If you could fork your brain state to create an exact clone, would that clone be "aligned" with you? I think that we should define the word "aligned" such that the answer is "yes". Common sense, right?

Seems to me that if you say "yes it's aligned" to that question, then you should also say "yes it's aligned" to a quantilize-from-the-human-policy agent. It's kinda in the same category, seems to me.

Hmm, Stuart Armstrong suggested here that "alignment is conditional: an AI is aligned with humans in certain circumstances, at certain levels of power." So then maybe as you quantilize harder and harder, you get less and less confident in that system's "alignment"?

(I'm not sure we're disagreeing about anything substantive, just terminology, right? Also, I don't actually personally buy into this quantilization picture, to be clear.)

Load More