I'm an AGI safety researcher in Boston, MA, USA, with a particular focus on brain algorithms. See https://sjbyrnes.com/agi.html Email: email@example.com
(BTW a lot of my thinking here came straight out of reading your model splintering posts. But maybe I've kinda wandered off in a different direction.)
So then in the scenario you mentioned, let's assume that we've set up the AI such that actions that pattern-match to "push the world into uncharted territory" are treated as unacceptable (which I guess seems like a plausibly good idea). But the AI is also motivated to get something done—say, solve global warming. And it finds a possible course of action which pattern-matches very well to "solve global warming", but alas, it also pattern-matches to "push the world into uncharted territory". The AI could reason that, if it queries the human (by "pressing the button" to send the data-dump), there's at least a chance that the human would edit its systems such that this course of action would no longer be unacceptable. So it would presumably do so.
In other words, this is a situation where the AI's motivational system is sending it mixed signals—it does want to "solve global warming", but it doesn't want to "push the world into uncharted territory", but this course of action is both. And let's assume that the AI can't easily come up with an alternative course of action that would solve global warming without any problematic aspects. So the AI asks the human what they think about this plan. Seems reasonable, I guess.
I haven't thought this through very much and look forward to you picking holes in it :)
How about "if I contain two subagents with different goals, they should execute Pareto-improving trades with each other"? This is an aspect of "becoming more rational", but it's not very well described by your maxim, because the maxim includes "your goal" as if that's well defined, right?
Unrelated topic: Maybe I didn't read carefully enough, but intuitively I treat "making a plan" and "executing a plan" as different, and I normally treat the word "planning" as referring just to the former, not the latter. Is that what you mean? Because executing a plan is obviously necessary too ....
Ah, so you are arguing against (3)? (And what's your stance on (1)?)
Let's say you are assigned to be Alice's personal assistant.
But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.
Hmm, I'm probably misunderstanding, but I feel like maybe you're making an argument like this:
(My probably-inaccurate elaboration of your argument.) We're making an extremely long list of the things that Alice cares about: "I like having all my teeth, and I like being able to watch football, and I like a pretty view out my window, etc. etc. etc." And each item that we add to the list costs one unit of value-alignment effort. And then "acting conservatively in regards to violating human preferences and norms in general, and in regards to Alice's preferences in particular" requires a very long list, and "synthesizing Alice's utility function" requires an only-slightly-longer list. Therefore we might as well do the latter.
But I don't think it's like that. For example, I think if an AGI watches a bunch of YouTube videos, it will be able to form a decent concept of "doing things that people would widely regard as uncontroversial and compatible with prevailing norms", and we can make it motivated to restrict its actions to that subspace via a constant amount of value-loading effort, i.e. with an amount of value-loading effort that does not scale with how complex those prevailing norms are. (More complex prevailing norms would require having the AGI watch more YouTube videos before it understands the prevailing norms, but it would not require more value-loading effort, i.e. the step where we edit the AGI's motivation such that it wants to follow prevailing norms would not be any harder.)
But I think it would take a lot more value-loading effort than that to really get a particular person's preferences, including all its idiosyncrasies and edge-cases.
Here are three things that I believe:
For example, I feel like it should be possible to make an AGI that understands human values and preferences well enough to reliably and conservatively avoid doing things that humans would see as obviously or even borderline unacceptable / problematic. So if you put it in the trolley problem, it says "I don't know, neither of those options seems obviously acceptable, so I am going to default to NOOP and let my supervisor take actions." Meanwhile, the AGI is also motivated to make me a cup of tea. Such an AGI seems pretty good to me. But it's contrary to (3).
I think this post is mainly arguing in favor of (2), and maybe weakly / implicitly arguing against (1). Is that right? And I'm not sure whether it's for or against (3).
Note that assumption 4 also needs to include a claim that we figure out the "secret sauce" sooner than other paths to AGI, despite lots of effort being put into them already.
Yup, "time until AGI via one particular path" is always an upper bound to "time until AGI". I added a note, thanks.
These seem easily like the load-bearing part of the argument; I agree the stuff you listed follows from these assumptions but why should these assumptions be true?
The only thing I'm arguing in this particular post is "IF assumptions THEN conclusion". This post is not making any argument whatsoever that you should put a high credence on the assumptions being true. :-)
The I part I'll agree with is: If we look at a dial, we can ask the question:
If there's an AGI with a safety-capabilities tradeoff dial, to what extent is the dial's setting externally legible / auditable to third parties?
More legible / auditable is better, because it could help enforcement.
I agree with this, and I have just added it to the article. But I disagree with your suggestion that this is counter to what I wrote. In my mind, it's an orthogonal dimension along which dials can vary. I think it's good if the dial is auditable, and I think it's also good if the dial corresponds to a very low alignment tax rate.
I interpret your comment as saying that the alignment tax rate doesn't matter because there will be enforcement, but I disagree with that. I would invoke an analogy to actual taxes. It is already required and enforced that individuals and companies pay (normal) taxes. But everyone knows that a 0.1% tax on Thing X will have a higher compliance rate than an 80% tax on Thing X, other things equal.
After all, everyone is making decisions about whether to pay the tax, versus not pay the tax. Not paying the tax has costs. It's a cost to hire lawyers that can do complicated accounting tricks. It's a cost to run the risk of getting fined or imprisoned. It's a cost to pack up your stuff and move to an anarchic war zone, or to a barge in the middle of the ocean, etc. It's a cost to get pilloried in the media for tax evasion. People will ask themselves: are these costs worth the benefits? If the tax is 0.1%, maybe it's not worth it, maybe it's just way better to avoid all that trouble by paying the tax. If the tax is 80%, maybe it is worth it to engage in tax evasion.
So anyway, I agree that "there will be good enforcement" is plausibly part of the answer. But good enforcement plus low tax will sum up to higher compliance than good enforcement by itself. Unless you think "perfect watertight enforcement" is easy, so that "willingness to comply" becomes completely irrelevant. That strikes me as overly optimistic. Perfect watertight enforcement of anything is practically nonexistent in this world. Perfect watertight enforcement of experimental AGI research would strike me as especially hard. After all, AGI research is feasible to do in a hidden basement / anarchic war zone / barge in the middle of the ocean / secret military base / etc. And there are already several billion GPUs untraceably dispersed all across the surface of Earth.
Great post, thanks for writing it!!
The links to http://modem2021.cs.nuigalway.ie/ are down at the moment, is that temporary, or did the website move or something?
Is it fair to say that all the things you're doing with multi-objective RL could also be called "single-objective RL with a more complicated objective"? Like, if you calculate the vector of values V, and then use a scalarization function S, then I could just say to you "Nope, you're doing normal single-objective RL, using the objective function S(V)." Right?
(Not that there's anything wrong with that, just want to make sure I understand.)
…this pops out at me because the two reasons I personally like multi-objective RL are not like that. Instead they're things that I think you genuinely can't do with one objective function, even a complicated one built out of multiple pieces combined nonlinearly. Namely, (1) transparency/interpretability [because a human can inspect the vector V], and (2) real-time control [because a human can change the scalarization function on the fly]. Incidentally, I think (2) is part of how brains work; an example of the real-time control is that if you're hungry, entertaining a plan that involves eating gets extra points from the brainstem/hypothalamus (positive coefficient), whereas if you're nauseous, it loses points (negative coefficient). That's my model anyway, you can disagree :) As for transparency/interpretability, I've suggested that maybe the vector V should have thousands of entries, like one for every word in the dictionary … or even millions of entries, or infinity, I dunno, can't have too much of a good thing. :-)
I was writing a kinda long reply but maybe I should first clarify: what do you mean by "model"? Can you give examples of ways that I could learn something (or otherwise change my synapses within a lifetime) that you wouldn't characterize as "changes to my mental model"? For example, which of the following would be "changes to my mental model"?
FWIW my inclination is to say that 1-4 are all "changes to my mental model". And 5 involves both changes to my mental model (knowing that I'm grumpy), and changes to the inputs to my mental model (I feel different "feelings" than I otherwise would—I think of those as inputs going into the model, just like visual inputs go into the model). Is there anything wrong / missing / suboptimal about that definition?
This one kinda confuses me. I'm of the opinion that the human brain is "constructed with a model explicitly, so that identifying the model is as simple as saying "the model is in this sub-module, the one labelled 'model'"." Of course the contents of the model are learned, but I think the question of whether any particular plastic synapse is or is not part of the information content of the model will have a straightforward yes-or-no answer. If that's right, then "it's hard to find the model (if any) in a trained model-free RL agent" is a disanalogy to "AIs learning human values". It would be more analogous to just train a MuZero clone, which has a labeled "model" component, instead of training a model-free RL.
And then looking at weights and activations would also be disanalogous to "AIs learning human values", since we probably won't have those kinds of real-time-brain-scanning technologies, right?
Sorry if I'm misunderstanding.
Ah, you mean that "alignment" is a different problem than "subhuman and human-imitating training safety"? :P
Ah, you mean that "alignment" is a different problem than "subhuman and human-imitating training safety"? :P
"Quantilizing from the human policy" is human-imitating in a sense, but also superhuman. At least modestly superhuman - depends on how hard you quantilize. (And maybe very superhuman in speed.)
If you could fork your brain state to create an exact clone, would that clone be "aligned" with you? I think that we should define the word "aligned" such that the answer is "yes". Common sense, right?
Seems to me that if you say "yes it's aligned" to that question, then you should also say "yes it's aligned" to a quantilize-from-the-human-policy agent. It's kinda in the same category, seems to me.
Hmm, Stuart Armstrong suggested here that "alignment is conditional: an AI is aligned with humans in certain circumstances, at certain levels of power." So then maybe as you quantilize harder and harder, you get less and less confident in that system's "alignment"?
(I'm not sure we're disagreeing about anything substantive, just terminology, right? Also, I don't actually personally buy into this quantilization picture, to be clear.)