This is a special post for short-form writing by Richard Ngo. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they'd converge to it eventually, but my guess is that this would take long enough that we'd already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the "convergence" argument). Analogously, humans don't care very much at all about the specific connections between our reward centers and the rest of our brains - insofar as we do want to influence them it's because we care about much more directly-observable p
Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.
For someone who's read v1 of this paper, what would you recommend as the best
way to "update" to v3? Is an entire reread the best approach?
[Edit March 11, 2023: Having now read the new version in full, my recommendation
to anyone else with the same question is a full reread.]
6Ajeya Cotra10mo
Note that the "without countermeasures" post consistently discusses both
possibilities (the model cares about reward or the model cares about something
else that's consistent with it getting very high reward on the training
dataset). E.g. see this paragraph from the above-the-fold intro:
As well as the section Even if Alex isn't "motivated" to maximize reward.... I
do place a ton of emphasis on the fact that Alex enacts a policy which has the
empirical effect of maximizing reward, but that's distinct from being confident
in the motivations that give rise to that policy. I believe Alex would try very
hard to maximize reward in most cases, but this could be for either terminal or
instrumental reasons.
With that said, for roughly the reasons Paul says above, I think I probably do
have a disagreement with Richard -- I think that caring about some version of
reward is pretty plausible (~50% or so). It seems pretty natural and easy to
grasp to me, and because I think there will likely be continuous online training
the argument that there's no notion of reward on the deployment distribution
doesn't feel compelling to me.
3Richard Ngo10mo
Yepp, agreed, the thing I'm objecting to is how you mainly focus on the reward
case, and then say "but the same dynamics apply in other cases too..."
The problem is that you need to reason about generalization to novel situations
somehow, and in practice that ends up being by reasoning about the underlying
motivations (whether implicitly or explicitly).
5Paul Christiano10mo
I'm not very convinced by this comment as an objection to "50% AI grabs power to
get reward." (I find it more plausible as an objection to "AI will definitely
grab power to get reward.")
This seems to be most of your position but I'm skeptical (and it's kind of just
asserted without argument):
* The data used in training is literally the only thing that AI systems
observe, and prima facie reward just seems like another kind of data that
plays a similarly central role. Maybe your "unnaturalness" abstraction can
make finer-grained distinctions than that, but I don't think I buy it.
* If people train their AI with RLDT then the AI is literally be trained to
predict reward! I don't see how this is remote, and I'm not clear if your
position is that e.g. the value function will be bad at predicting reward
because it is an "unnatural" target for supervised learning.
* I don't understand the analogy with humans. It sounds like you are saying "an
AI system selected based on the reward of its actions learns to select
actions it expects to lead to high reward" be analogous to "humans care about
the details of their reward circuitry." But:
* I don't think human learning is just RL based on the reward circuit; I
think this is at least a contrarian position and it seems unworkable to me
as an explanation of human behavior.
* It seems like the analogous conclusion for RL systems would be "they may
not care about the rewards that go into the SGD update, they may instead
care about the rewards that get entered into the dataset, or even something
further causally upstream of that as long as it's very well-correlated on
the training set." But it doesn't matter what we choose that's causally
upstream of rewards, as long as it's perfectly correlated on the training
set?
* (Or you could be saying that humans are motivated by pleasure and pain but
not the entire suite of things that are upstream of rewar
2Alex Turner9mo
(Emphasis added)
I don't think this engages with the substance of the analogy to humans. I don't
think any party in this conversation believes that human learning is "just" RL
based on a reward circuit, and I don't believe it either. "Just RL" also isn't
necessary for the human case to give evidence about the AI case. Therefore, your
summary seems to me like a strawman of the argument.
I would say "human value formation mostly occurs via RL & algorithms
meta-learned thereby, but in the important context of SSL / predictive
processing, and influenced by inductive biases from high-level connectome
topology and genetically specified reflexes and environmental regularities
and..."
Furthermore, we have good evidence that RL plays an important role in human
learning. For example, from The shard theory of human values:
2Paul Christiano9mo
This is incredibly weak evidence.
* Animals were selected over millions of generations to effectively pursue
external goals. So yes, they have external goals.
* Humans also engage in within-lifetime learning, so of course you see all
kinds of indicators of that in brains.
Both of those observations have high probability, so they aren't significant
Bayesian evidence for "RL tends to produce external goals by default."
In particular, for this to be evidence for Richard's claim, you need to say: "If
RL tended to produce systems that care about reward, then RL would be
significantly less likely to play a role in human cognition." There's some
update there but it's just not big. It's easy to build brains that use RL as
part of a more complicated system and end up with lots of goals other than
reward. My view is probably the other way---humans care about reward more than
I would guess from the actual amount of RL they can do over the course of their
life (my guess is that other systems play a significant role in our conscious
attitude towards pleasure).
2Alex Turner9mo
I don't understand why you think this explains away the evidential impact, and I
guess I put way less weight on selection reasoning than you do. My reasoning
here goes:
1. Lots of animals do reinforcement learning.
2. In particular, humans prominently do reinforcement learning.
3. Humans care about lots of things in reality, not just certain kinds of
cognitive-update-signals.
4. "RL -> high chance of caring about reality" predicts this observation more
strongly than "RL -> low chance of caring about reality"
This seems pretty straightforward to me, but I bet there are also pieces of your
perspective I'm just not seeing.
But in particular, it doesn't seem relevant to consider selection pressures from
evolution, except insofar as we're postulating additional mechanisms which
evolution found which explain away some of the reality-caring? That would weaken
(but not eliminate) the update towards "RL -> high chance of caring about
reality."
I don't see how this point is relevant. Are you saying that within-lifetime
learning is unsurprising, so we can't make further updates by reasoning about
how people do it?
I'm saying that there was a missed update towards that conclusion, so it doesn't
matter if we already knew that humans do within-lifetime learning?
7Paul Christiano9mo
You seem to be saying P(humans care about the real world | RL agents usually
care about reward) is low. I'm objecting, and claiming that in fact P(humans
care about the real world | RL agents usually care about reward) is fairly high,
because humans are selected to care about the real world and evolution can be
picky about what kind of RL it does, and it can (and does) throw tons of other
stuff in there.
The Bayesian update is P(humans care about the real world | RL agents usually
care about reward) / P(humans care about the real world | RL agents mostly care
about other stuff). So if e.g. P(humans care about the real world | RL agents
don't usually care about reward) was 80%, then your update could be at most
1.25. In fact I think it's even smaller than that..
And then if you try to turn that into evidence about "reward is a very hard
concept to learn," or a prediction about how neural nets trained with RL will
behave, it's moving my odds ratios by less than 10% (since we are using "RL"
quite loosely in this discussion, and there are lots of other differences and
complications at play, all of which shrink the update).
You seem to be saying "yes but it's evidence," which I'm not objecting to---I'm
just saying it's an extremely small amount evidence. I'm not clear on whether
you agree with my calculation.
(Some of the other text I wrote was about a different argument you might be
making: that P(humans use RL | RL agents usually care about reward) is
significantly lower than P(humans use RL| RL agents mostly are about other
stuff), because evolution would then have never used RL. My sense is that you
aren't making this argument so you should ignore all of that, sorry to be
confusing.)
3Alex Turner7mo
Just saw this reply recently. Thanks for leaving it, I found it stimulating.
(I wrote the following rather quickly, in an attempt to write anything at all,
as I find it not that pleasant to write LW comments -- no offense to you in
particular. Apologies if it's confusing or unclear.)
Yes, in large part.
Yeah, are people differentially selected for caring about the real world? At the
risk of seeming facile, this feels non-obvious. My gut take is that conditional
on RL agents usually caring about reward (and thus setting aside a bunch of my
inside-view reasoning about how RL dynamics work), conditional on that --
reward-humans could totally have been selected for.
This would drive up P(humans care about reward | RL agents care about reward,
humans were selected by evolution), and thus (I think?) drive down P(humans care
about the real world | RL agents usually care about reward).
POV: I'm in an ancestral environment, and I (somehow) only care about the
rewarding feeling of eating bread. I only care about the nice feeling which
comes from having sex, or watching the birth of my son, or being gaining power
in the tribe. I don't care about the real-world status of my actual son,
although I might have strictly instrumental heuristics about e.g. how to keep
him safe and well-fed in certain situations, as cognitive shortcuts for getting
reward (but not as terminal values).
In what way is my fitness lower than someone who really cares about these
things, given that the best way to get rewards may well be to actually do the
things?
Here are some ways I can think of:
1. Caring about reward directly makes reward hacking a problem evolution has to
solve, and if it doesn't solve it properly, the person ends up masturbating
and not taking (re)productive actions.
1. Counter-counterpoint: But also many people do in fact enjoy masturbating,
even though it seems (to my naive view) like an obvious thing to select
away, which was present ancestral
5Vivek Hebbar5mo
Would such a person sacrifice themselves for their children (in situations where
doing so would be a fitness advantage)?
4Alex Turner5mo
I think this highlights a good counterpoint. I think this alternate theory
predicts "probably not", although I can contrive hypotheses for why people would
sacrifice themselves (because they have learned that high-status -> reward; and
it's high-status to sacrifice yourself for your kid). Or because keeping your
kid safe -> high reward as another learned drive.
Overall this feels like contortion but I think it's possible. Maybe overall this
is a... 1-bit update against the "not selection for caring about reality" point?
2Alex Turner9mo
I don't know what this means. Suppose we have an AI which "cares about reward"
(as you think of it in this situation). The "episode" consists of the AI copying
its network & activations to another off-site server, and then the original lab
blows up. The original reward register no longer exists (it got blown up), and
the agent is not presently being trained by an RL alg.
What is the "reward" for this situation? What would have happened if we
"sampled" this episode during training?
4Paul Christiano9mo
I agree there are all kinds of situations where the generalization of "reward"
is ambiguous and lots of different things could happen . But it has a clear
interpretation for the typical deployment episode since we can take
counterfactuals over the randomization used to select training data.
It's possible that agents may specifically want to navigate towards situations
where RL training is not happening and the notion of reward becomes ambiguous,
and indeed this is quite explicitly discussed in the document Richard is
replying to.
As far as I can tell the fact that there exist cases where different
generalizations of reward behave differently does not undermine the point at
all.
2Alex Turner9mo
Yeah, I think I was wondering about the intended scoping of your statement. I
perceive myself to agree with you that there are situations (like LLM training
to get an alignment research assistant) where "what if we had sampled during
training?" is well-defined and fine. I was wondering if you viewed this as a
general question we could ask.
I also agree that Ajeya's post addresses this "ambiguity" question, which is
nice!
2Lauro Langosco10mo
I agree with your general point here, but I think Ajeya's post actually gets
this right, eg
and
2Lauro Langosco10mo
I also think that often "the AI just maximizes reward" is a useful simplifying
assumption. That is, we can make an argument of the form "even if the AI just
maximizes reward, it still takes over; if it maximizes some correlate of the
reward instead, then we have even less control over what it does and so are even
more doomed".
(Though of course it's important to spell the argument out)
2Ajeya Cotra10mo
Yeah, I agree this is a good argument structure -- in my mind, maximizing reward
is both a plausible case (which Richard might disagree with) and the best case
(conditional on it being strategic at all and not a bag of heuristics), so it's
quite useful to establish that it's doomed; that's the kind of structure I was
going for in the post.
5Richard Ngo10mo
I strongly disagree with the "best case" thing. Like, policies could just learn
human values! It's not that implausible.
If I had to try point to the crux here, it might be "how much selection pressure
is needed to make policies learn goals that are abstractly related to their
training data, as opposed to goals that are fairly concretely related to their
training data?" Where we both agree that there's some selection pressure towards
reward-like goals, and it seems like you expect this to be enough to lead
policies to behavior that violates all their existing heuristics, whereas I'm
more focused on the regime where there are lots of low-hanging fruit in terms of
changes that would make a policy more successful, and so the question of how
easy that goal is to learn from its training data is pretty important. (As
usual, there's the human analogy: our goals are very strongly biased towards
things we have direct observational access to!)
Even setting aside this disagreement, though, I don't like the argumentative
structure because the generalization of "reward" to large scales is much less
intuitive than the generalization of other concepts (like "make money") to large
scales - in part because directly having a goal of reward is a kinda
counterintuitive self-referential thing.
3Ajeya Cotra10mo
Yes, sorry, "best case" was oversimplified. What I meant is that generalizing to
want reward is in some sense the model generalizing "correctly;" we could get
lucky and have it generalize "incorrectly" in an important sense in a way that
happens to be beneficial to us. I discuss this a bit more here.
I don't understand why reward isn't something the model has direct access to --
it seems like it basically does? If I had to say which of us were focusing on
abstract vs concrete goals, I'd have said I was thinking about concrete goals
and you were thinking about abstract ones, so I think we have some disagreement
of intuition here.
Yeah, I don't really agree with this; I think I could pretty easily imagine
being an AI system asking the question "How much reward would this episode get
if it were sampled for training?" It seems like the intuition this is weird and
unnatural is doing a lot of work in your argument, and I don't really share it.
3Alex Turner9mo
See also: Inner and outer alignment decompose one hard problem into two
extremely hard problems (in particular: Inner alignment seems anti-natural).
A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it's very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.
(For existing models, it might be more productive to ask for "surprising behavior" rather than "failures" per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though... but maybe understanding models better is robustly good enough to outweight that?)
A short note on a point that I'd been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misaligned goal were "make as many paperclips as possible", but the goal "make as many staples as possible" could be represented more simply in the weights, then the weights should slowly drift from the former to the latter throughout training.
But actually, it'd likely be even simpler to get rid of the underlying misaligned goal, and just have alignment with the outer reward function as the terminal goal. So this argument suggests that even policies which start off misaligned would plausibly become aligned if they had to act deceptively aligned for long enough. (This sometimes happens in humans too, btw.)
Reasons this argument might not be relevant: - The policy doing some kind of gradient hacking - The policy being implemented using some kind of modular architecture (which may explain why this phenomenon isn't very robust in humans)
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
So I'm imagining the agent doing reasoning like:
Misaligned goal --> I should get high reward --> Behavior aligned with reward
function
and then I'm hypothesizing that the whatever the first misaligned goal is, it
requires some amount of complexity to implement, and you could just get rid of
it and make "I should get high reward" the terminal goal. (I could imagine this
being false though depending on the details of how terminal and instrumental
goals are implemented.)
I could also imagine something more like:
Misaligned goal --> I should behave in aligned ways --> Aligned behavior
and then the simplicity bias pushes towards alignment. But if there are outer
alignment failures then this incurs some additional complexity compared with the
first option.
Or a third, perhaps more realistic option is that the misaligned goal leads to
two separate drives in the agent: "I should get high reward" and "I should
behave in aligned ways", and that the question of which ends up dominating when
they clash will be determined by how the agent systematizes multiple goals into
a single coherent strategy (I'll have a post on that topic up soon).
2Alex Turner6mo
Why would the agent reason like this?
2Richard Ngo6mo
Because of standard deceptive alignment reasons (e.g. "I should make sure
gradient descent doesn't change my goal; I should make sure humans continue to
trust me").
3Alex Turner6mo
I think you don't have to reason like that to avoid getting changed by SGD.
Suppose I'm being updated by PPO, with reinforcement events around navigating to
see dogs. To preserve my current shards, I don't need to seek out a huge number
of dogs proactively, but rather I just need to at least behave in conformance
with the advantage function implied by my value head, which probably means
"treading water" and seeing dogs sometimes in situations similar to historical
dog-seeing events.
Maybe this is compatible with what you had in mind! It's just not something that
I think of as "high reward."
And maybe there's some self-fulfilling prophecy where we trust models which get
high reward, and therefore they want to get high reward to earn our trust... but
that feels quite contingent to me.
2Richard Ngo6mo
I think this depends sensitively on whether the "actor" and the "critic" in fact
have the same goals, and I feel pretty confused about how to reason about this.
For example, in some cases they could be two separate models, in which case the
critic will most likely accurately estimate that "treading water" is in fact a
negative-advantage action (unless there's some sort of acausal coordination
going on). Or they could be two copies of the same model, in which case the
critic's responses will depend on whether its goals are indexical or not (if
they are, they're different from the actor's goals; if not, they're the same)
and how easily it can coordinate with the actor. Or it could be two heads which
share activations, in which case we can plausibly just think of the critic and
the actor as two types of outcomes taken by a single coherent agent - but then
the critic doesn't need to produce a value function that's consistent with
historical events, because an actor and a critic that are working together could
gradient hack into all sorts of weird equilibria.
1SoerenMind5mo
The shortest description of this thought doesn't include "I should get high
reward" because that's already implied by having a misaligned goal and planning
with it.
In contrast, having only the goal "I should get high reward" may add description
length like Johannes said. If so, the misaligned goal could well be equally
simple or simpler than the high reward goal.
3Alex Turner6mo
Can you say why you think that weight-based regularization would drift the
weights to the latter? That seems totally non-obvious to me, and probably false.
2Richard Ngo6mo
In general if two possible models perform the same, then I expect the weights to
drift towards the simpler one. And in this case they perform the same because of
deceptive alignment: both are trying to get high reward during training in order
to be able to carry out their misaligned goal later on.
3SoerenMind6mo
Interesting point. Though on this view, "Deceptive alignment preserves goals"
would still become true once the goal has drifted to some random maximally
simple goal for the first time.
To be even more speculative: Goals represented in terms of existing concepts
could be simple and therefore stable by default. Pretrained models represent all
kinds of high-level states, and weight-regularization doesn't seem to change
this in practice. Given this, all kinds of goals could be "simple" as they
piggyback on existing representations, requiring little additional description
length.
2Richard Ngo6mo
This doesn't seem implausible. But on the other hand, imagine an agent which
goes through a million episodes, and in each one reasons at the beginning "X is
my misaligned terminal goal, and therefore I'm going to deceptively behave as if
I'm aligned" and then acts perfectly like an aligned agent from then on. My
claims then would be:
a) Over many update steps, even a small description length penalty of having
terminal goal X (compared with being aligned) will add up.
b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in
practice are biased against runtime penalties (at the very least because it
prevents them from doing other more useful stuff with that runtime).
In a setting where you also have outer alignment failures, the same argument
still holds, just replace "aligned agent" with "reward-maximizing agent".
Probably the easiest "honeypot" is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that's anything like "get more reward" (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
The crucial heuristic I apply when evaluating AI safety research directions is: could we have used this research to make humans safe, if we were supervising the human evolutionary process? And if not, do we have a compelling story for why it'll be easier to apply to AIs than to humans?
Sometimes this might be too strict a criterion, but I think in general it's very valuable in catching vague or unfounded assumptions about AI development.
By making human safe, do you mean with regard to evolution's objective?
1Richard Ngo3y
No. I meant: suppose we were rerunning a simulation of evolution, but can modify
some parts of it (e.g. evolution's objective). How do we ensure that whatever
intelligent species comes out of it is safe in the same ways we want AGIs to be
safe?
(You could also think of this as: how could some aliens overseeing human
evolution have made humans safe by those aliens' standards of safety? But this
is a bit trickier to think about because we don't know what their standards are.
Although presumably current humans, being quite aggressive and having unbounded
goals, wouldn't meet them).
2Adam Shimi3y
Okay, thanks. Could you give me an example of a research direction that passes
this test? The thing I have in mind right now is pretty much everything that
backchain to local search, but maybe that's not the way you think about it.
1Richard Ngo3y
So I think Debate is probably the best example of something that makes a lot of
sense when applied to humans, to the point where they're doing human experiments
on it already.
But this heuristic is actually a reason why I'm pretty pessimistic about most
safety research directions.
2Adam Shimi3y
So I've been thinking about this for a while, and I think I disagree with what I
understand of your perspective. Which might obviously mean I misunderstand your
perspective.
What I think I understand is that you judge safety research directions based on
how well they could work on an evolutionary process like the one that created
humans. But for me, the most promising approach to AGI is based on local search,
which differs a bit from evolutionary process. I don't really see a reason to
consider evolutionary processes instead of local search, and even then, the
specific approach of evolution for humans is probably far too specific as a test
bench.
This matters because problems for one are not problems for the other. For
example, one way to mess with an evolutionary process is to find way for
everything to survive and reproduce/disseminate. Technology in general did that
for humans, which means the evolutionary pressure decreased as technology
evolved. But that's not a problem for local search, since at each step there
will be only one next program.
On the other hand, local search might be dangerous because of things like
gradient hacking. And they don't make sense for evolutionary processes.
In conclusion, I feel for the moment that backchaining to local search is a
better heuristic for judging safety research directions. But I'm curious about
where our disagreement lies on this issue.
4Richard Ngo3y
One source of our disagreement: I would describe evolution as a type of local
search. The difference is that it's local with respect to the parameters of a
whole population, rather than an individual agent. So this does introduce some
disanalogies, but not particularly significant ones (to my mind). I don't think
it would make much difference to my heuristic if we imagined that humans had
evolved via gradient descent over our genes instead.
In other words, I like the heuristic of backchaining to local search, and I
think of it as a subset of my heuristic. The thing it's missing, though, is that
it doesn't tell you which approaches will actually scale up to training regimes
which are incredibly complicated, applied to fairly intelligent agents. For
example, impact penalties make sense in a local search context for simple
problems. But to evaluate whether they'll work for AGIs, you need to apply them
to massively complex environments. So my intuition is that, because I don't know
how to apply them to the human ancestral environment, we also won't know how to
apply them to our AGIs' training environments.
Similarly, when I think about MIRI's work on decision theory, I really have very
little idea how to evaluate it in the context of modern machine learning. Are
decision theories the type of thing which AIs can learn via local search? Seems
hard to tell, since our AIs are so far from general intelligence. But I can
reason much more easily about the types of decision theories that humans have,
and the selective pressures that gave rise to them.
As a third example, my heuristic endorses Debate due to a high-level intuition
about how human reasoning works, in addition to a low-level intuition about how
it can arise via local search.
2Adam Shimi3y
So if I try to summarize your position, it's something like: backchain to local
search for simple and single-AI cases, and then think about aligning humans for
the scaled and multi-agents version? That makes much more sense, thanks!
I also definitely see why your full heuristic doesn't feel immediately useful to
me: because I mostly focus on the simple and single-AI case. But I've been
thinking more and more (in part thanks to your writing) that I should allocate
more thinking time to the more general case. I hope your heuristic will help me
there.
2Richard Ngo3y
Cool, glad to hear it. I'd clarify the summary slightly: I think all safety
techniques should include at least a rough intuition for why they'll work in the
scaled-up version, even when current work on them only applies them to simple
AIs. (Perhaps this was implicit in your summary already, I'm not sure.)
Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:
MIRI cluster. Think that P(doom) is very high, based on intuitions about instrumental convergence, deceptive alignment, etc. Does work that's very different from mainstream ML. Central members: Eliezer Yudkowsky, Nate Soares.
Structural risk cluster. Think that doom is more likely than not, but not for the same reasons as the MIRI cluster. Instead, this cluster focuses on systemic risks, multi-agent alignment, selective forces outside gradient descent, etc. Often work that's fairly continuous with mainstream ML, but willing to be unusually speculative by the standards of the field. Central members: Dan Hendrycks, David Krueger, Andrew Critch.
Constellation cluster. More optimistic than either of the previous two clusters. Focuses more on risk from power-seeking AI than the structural risk cluster, but does work that is more speculative or conceptually-oriented than mainstream ML. Central members: Paul Christiano, Buck Shlegeris, Holden Karnofsky. (Named after Constellation coworking space.)
Imagine taking someone's utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I'd want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.
But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret "similar to me" as de dicto vs de re - i.e. whether it refers to the old me or the new me.
This is a more general problem when one person's utility function can depend on another person's, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There's probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct typ... (read more)
A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.
I think this is useful for framing my core concerns about current safety research:
If we think that unsupervised learning will produce safe agents, then why will the comparatively small contributions of SL and RL make them unsafe?
If we think that unsupervised learning will produce dangerous agents, then why will safety techniques which focus on SL and RL (i.e. basically all of them) work, when they're making comparatively small updates to agents which are already misaligned?
I do think it's more complicated than I've portrayed here, but I haven't yet seen a persuasive response to the core intuition.
I wrote a few posts on self-supervised learning last year:
* https://www.lesswrong.com/posts/SaLc9Dv5ZqD73L3nE/the-self-unaware-ai-oracle
* https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety
* https://www.lesswrong.com/posts/L3Ryxszc3X2J7WRwt/self-supervised-learning-and-manipulative-predictions
I'm not aware of any airtight argument that "pure" self-supervised learning
systems, either generically or with any particular architecture, are safe to
use, to arbitrary levels of intelligence, though it seems very much worth
someone trying to prove or disprove that. For my part, I got distracted by other
things and haven't thought about it much since then.
The other issue is whether "pure" self-supervised learning systems would be
capable enough to satisfy our AGI needs, or to safely bootstrap to systems that
are. I go back and forth on this. One side of the argument I wrote up here. The
other side is, I'm now (vaguely) thinking that people need a reward system to
decide what thoughts to think, and the fact that GPT-3 doesn't need reward is
not evidence of reward being unimportant but rather evidence that GPT-3 is
nothing like an AGI. Well, maybe.
For humans, self-supervised learning forms the latent representations, but the
reward system controls action selection. It's not altogether unreasonable to
think that action selection, and hence reward, is a more important thing to
focus on for safety research. AGIs are dangerous when they take dangerous
actions, to a first approximation. The fact that a larger fraction of
neocortical synapses are adjusted by self-supervised learning than by reward
learning is interesting and presumably safety-relevant, but I don't think it
immediately proves that self-supervised learning has a similarly larger fraction
of the answers to AGI safety questions. Maybe, maybe not, it's not immediately
obvious. :-)
Oracle-genie-sovereign is a really useful distinction that I think I (and probably many others) have avoided using mainly because "genie" sounds unprofessional/unacademic. This is a real shame, and a good lesson for future terminology.
Perhaps the lesson is that terminology that is acceptable in one field (in this
case philosophy) might not be suitable in another (in this case machine
learning).
2Richard Ngo3y
I don't think that even philosophers take the "genie" terminology very
seriously. I think the more general lesson is something like: it's particularly
important to spend your weirdness points wisely when you want others to copy
you, because they may be less willing to spend weirdness points.
2Adam Shimi3y
After rereading the chapter in Superintelligence, it seems to me that "genie"
captures something akin to act-based agents. Do you think that's the main way to
use this concept in the current state of the field, or do you have other
applications in mind?
1Richard Ngo3y
Ah, yeah, that's a great point. Although I think act-based agents is a pretty
bad name, since those agents may often carry out a whole bunch of acts in a row
- in fact, I think that's what made me overlook the fact that it's pointing at
the right concept. So not sure if I'm comfortable using it going forward, but
thanks for pointing that out.
1Adam Shimi3y
Is that from Superintelligence? I googled it, and that was the most convincing
result.
I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven't yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs - i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)
I suspect that AIXI is misleading to think about in large part because it lacks reusable parameters - instead it just memorises all inputs it's seen so far. Which means the setup doesn't have episodes, or a training/deployment distinction; nor is any behaviour actually "reinforced".
I kind of think the lack of episodes makes it more realistic for many problems,
but admittedly not for simulated games. Also, presumably many of the component
Turing machines have reusable parameters and reinforce behaviour, altho this is
hidden by the formalism. [EDIT: I retract the second sentence]
1DanielFilan3y
Actually I think this is total nonsense produced by me forgetting the difference
between AIXI and Solomonoff induction.
1Richard Ngo3y
Wait, really? I thought it made sense (although I'd contend that most people
don't think about AIXI in terms of those TMs reinforcing hypotheses, which is
the point I'm making). What's incorrect about it?
1DanielFilan3y
Well now I'm less sure that it's incorrect. I was originally imagining that like
in Solomonoff induction, the TMs basically directly controlled AIXI's actions,
but that's not right: there's an expectimax. And if the TMs reinforce actions by
shaping the rewards, in the AIXI formalism you learn that immediately and throw
out those TMs.
1Richard Ngo3y
Oh, actually, you're right (that you were wrong). I think I made the same
mistake in my previous comment. Good catch.
1[comment deleted]3y
2Steve Byrnes3y
Humans don't have a training / deployment distinction either... Do humans have
"reusable parameters"? Not quite sure what you mean by that.
3Richard Ngo3y
Yes we do: training is our evolutionary history, deployment is an individual
lifetime. And our genomes are our reusable parameters.
Unfortunately I haven't yet written any papers/posts really laying out this
analogy, but it's pretty central to the way I think about AI, and I'm working on
a bunch of related stuff as part of my PhD, so hopefully I'll have a more
complete explanation soon.
1Steve Byrnes3y
Oh, OK, I see what you mean. Possibly related: my comment here.
A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.
The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.
What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to und... (read more)
That doesn't actually solve the problem. The system could just encode the
desired information in the semantics of some unrelated sentences - e.g. talk
about pasta to indicate X = 0, or talk about rain to indicate X = 1.
2Robert Kirk1y
Another possible way to provide pressure towards using language in a human-sense
way is some form of multi-tasking/multi-agent scenario, inspired by this paper:
Multitasking Inhibits Semantic Drift. They show that if you pretrain multiple
instructors and instruction executors to understand language in a human-like way
(e.g. with supervised labels), and then during training mix the instructors and
instruction executors, it makes it difficult to drift from the original
semantics, as all the instructors and instruction executors would need to drift
in the same direction; equivalently, any local change in semantics would be
sub-optimal compared to using language in the semantically correct way. The
examples in the paper are on quite toy problems, but I think in principle this
could work.
There's some possible world in which the following approach to interpretability works:
Put an AGI in a bunch of situations where it sometimes is incentivised to lie and sometimes is incentivised to tell the truth.
Train a lie detector which is given all its neural weights as input.
Then ask the AGI lots of questions about its plans.
One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altr... (read more)
I've heard people argue that "most" utility functions lead to agents with strong convergent instrumental goals. This obviously depends a lot on how you quantify over utility functions. Here's one intuition in the other direction. I don't expect this to be persuasive to most people who make the argument above (but I'd still be interested in hearing why not).
If a non-negligible percentage of an agent's actions are random, then to describe it as a utility-maximiser would require an incredibly complex utility function (becaus... (read more)
I'm not sure if you consider me to be making that argument, but here are my
thoughts: I claim that most reward functions lead to agents with strong
convergent instrumental goals. However, I share your intuition that (somehow)
uniformly sampling utility functions over universe-histories might not lead to
instrumental convergence.
To understand instrumental convergence and power-seeking, consider how many
reward functions we might specify automatically imply a causal mechanism for
increasing reward. The structure of the reward function implies that more is
better, and that there are mechanisms for repeatedly earning points (for
example, by showing itself a high-scoring input).
Since the reward function is "simple" (there's usually not a way to grade exact
universe histories), these mechanisms work in many different situations and
points in time. It's naturally incentivized to assure its own safety in order to
best leverage these mechanisms for gaining reward. Therefore, we shouldn't be
surprised to see a lot of these simple goals leading to the same kind of
power-seeking behavior.
What structure is implied by a reward function?
* Additive/Markovian: while a utility function might be over an entire
universe-history, reward is often additive over time steps. This is a strong
constraint which I don't always expect to be true, but i think that among the
goals with this structure, a greater proportion of them have power-seeking
incentives.
* Observation-based: while a utility function might be over an entire
universe-history, the atom of the reward function is the observation. Perhaps
the observation is an input to update a world model, over which we have tried
to define a reward function. I think that most ways of doing this lead to
power-seeking incentives.
* Agent-centric: reward functions are defined with respect to what the agent
can observe. Therefore, in partially observable environments, there is
naturally a greater emphasis on
3Richard Ngo3y
I've just put up a post which serves as a broader response to the ideas
underpinning this type of argument.
3Richard Ngo3y
I think this depends a lot on how you model the agent developing. If you start
off with a highly intelligent agent which has the ability to make long-term
plans, but doesn't yet have any goals, and then you train it on a random reward
function - then yes, it probably will develop strong convergent instrumental
goals.
On the other hand, if you start off with a randomly initialised neural network,
and then train it on a random reward function, then probably it will get stuck
in a local optimum pretty quickly, and never learn to even conceptualise these
things called "goals".
I claim that when people think about reward functions, they think too much about
the former case, and not enough about the latter. Because while it's true that
we're eventually going to get highly intelligent agents which can make long-term
plans, it's also important that we get to control what reward functions they're
trained on up to that point. And so plausibly we can develop intelligent agents
that, in some respects, are still stuck in "local optima" in the way they think
about convergent instrumental goals - i.e. they're missing whatever cognitive
functionality is required for being ambitious on a large scale.
1Alex Turner3y
Agreed – I should have clarified. I've been mostly discussing instrumental
convergence with respect to optimal policies. The path through policy space is
also important.
Makes sense. For what it's worth, I'd also argue that thinking about optimal policies at all is misguided (e.g. what's the optimal policy for humans - the literal best arrangement of neurons we could possibly have for our reproductive fitness? Probably we'd be born knowing arbitrarily large amounts of information. But this is just not relevant to predicting or modifying our actual behaviour at all).
(I now think that you were very right in saying "thinking about optimal policies at all is misguided", and I was very wrong to disagree. I've thought several times about this exchange. Not listening to you about this point was a serious error and made my work way less impactful. I do think that the power-seeking theorems say interesting things, but about eg internal utility functions over an internal planning ontology -- not about optimal policies for a reward function.)
I disagree.
1. We do in fact often train agents using algorithms which are proven to
eventually converge to the optimal policy.[1] Even if we don't expect the
trained agents to reach the optimal policy in the real world, we should
still understand what behavior is like at optimum. If you think your
proposal is not aligned at optimum but is aligned for realistic training
paths, you should have a strong story for why.
2. Formal theorizing about instrumental convergence with respect to optimal
behavior is strictly easier than theorizing about ϵ-optimal behavior, which
I think is what you want for a more realistic treatment of instrumental
convergence for real agents. Even if you want to think about sub-optimal
policies, if you don't understand optimal policies... good luck! Therefore,
we also have an instrumental (...) interest in studying the behavior at
optimum.
--------------------------------------------------------------------------------
1. At least, the tabular algorithms are proven, but no one uses those for real
stuff. I'm not sure what the results are for function approximators, but I
think you get my point. ↩︎
1Richard Ngo3y
1. I think it's more accurate to say that, because approximately none of the
non-trivial theoretical results hold for function approximation, approximately
none of our non-trivial agents are proven to eventually converge to the optimal
policy. (Also, given the choice between an algorithm without convergence proofs
that works in practice, and an algorithm with convergence proofs that doesn't
work in practice, everyone will use the former). But we shouldn't pay any
attention to optimal policies anyway, because the optimal policy in an
environment anything like the real world is absurdly, impossibly complex, and
requires infinite compute.
2. I think theorizing about ϵ-optimal behavior is more useful than theorizing
about optimal behaviour by roughly ϵ, for roughly the same reasons. But in
general, clearly I can understand things about suboptimal policies without
understanding optimal policies. I know almost nothing about the optimal policy
in StarCraft, but I can still make useful claims about AlphaStar (for example:
it's not going to take over the world).
Again, let's try cash this out. I give you a human - or, say, the emulation of a
human, running in a simulation of the ancestral environment. Is this safe? How
do you make it safer? What happens if you keep selecting for intelligence? I
think that the theorising you talk about will be actively harmful for your
ability to answer these questions.
1Alex Turner3y
I'm confused, because I don't disagree with any specific point you make - just
the conclusion. Here's my attempt at a disagreement which feels analogous to me:
My response in this "debate" is: if you start with a spherical cow and then
consider which real world differences are important enough to model, you're
better off than just saying "no one should think about spherical cows".
I don't understand why you think that. If you can have a good understanding of
instrumental convergence and power-seeking for optimal agents, then you can
consider whether any of those same reasons apply for suboptimal humans.
Considering power-seeking for optimal agents is a relaxed problem. Yes, ideally,
we would instantly jump to the theory that formally describes power-seeking for
suboptimal agents with realistic goals in all kinds of environments. But before
you do that, a first step is understanding power-seeking in MDPs. Then, you can
take formal insights from this first step and use them to update your
pre-theoretic intuitions where appropriate.
5Richard Ngo3y
Thanks for engaging despite the opacity of the disagreement. I'll try to make my
position here much more explicit (and apologies if that makes it sound brusque).
The fact that your model is a simplified abstract model is not sufficient to
make it useful. Some abstract models are useful. Some are misleading and will
cause people who spend time studying them to understand the underlying
phenomenon less well than they did before. From my perspective, I haven't seen
you give arguments that your models are in the former category not the latter.
Presumably you think they are in fact useful abstractions - why? (A few examples
of the latter: behaviourism, statistical learning theory, recapitulation theory,
Gettier-style analysis of knowledge).
My argument for why they're overall misleading: when I say that "the optimal
policy in an environment anything like the real world is absurdly, impossibly
complex, and requires infinite compute", or that safety researchers shouldn't
think about AIXI, I'm not just saying that these are inaccurate models. I'm
saying that they are modelling fundamentally different phenomena than the ones
you're trying to apply them to. AIXI is not "intelligence", it is brute force
search, which is a totally different thing that happens to look the same in the
infinite limit. Optimal tabular policies are not skill at a task, they are a
cheat sheet, but they happen to look similar in very simple cases.
Probably the best example of what I'm complaining about is Ned Block trying to
use Blockhead to draw conclusions about intelligence. I think almost everyone
around here would roll their eyes hard at that. But then people turn around and
use abstractions that are just as unmoored from reality as Blockhead, often in a
very analogous way. (This is less a specific criticism of you, TurnTrout, and
more a general criticism of the field).
Forgive me a little poetic license. The analogy in my mind is that you were
trying to model the cow as a sphere, but you didn
2Alex Turner3y
Thanks for elaborating this interesting critique. I agree we generally need to
be more critical of our abstractions.
Falsifying claims and "breaking" proposals is a classic element of AI alignment
discourse and debate. Since we're talking about superintelligent agents, we
can't predict exactly what a proposal would do. However, if I make a claim ("a
superintelligent paperclip maximizer would keep us around because of gains from
trade"), you can falsify this by showing that my claimed policy is dominated by
another class of policies ("we would likely be comically resource-inefficient in
comparison; GFT arguments don't model dynamics which allow killing other agents
and appropriating their resources").
Even we can come up with this dominant policy class, so the posited
superintelligence wouldn't miss it either. We don't know what the
superintelligent policy will be, but we know what it won't be (see also
Formalizing convergent instrumental goals). Even though I don't know how Gary
Kasparov will open the game, I confidently predict that he won't let me
checkmate him in two moves.
Non-optimal power and instrumental convergence
Instead of thinking about optimal policies, let's consider the performance of a
given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as
input, and outputs a policy.
Definition. Let R be a continuous distribution over reward functions with CDF F.
The average return achieved by algorithm A at state s and discount rate γ is
∫RVA(M,R)R(s,γ)dF(R).
Instrumental convergence with respect to A's policies can be defined similarly
("what is the R-measure of a given trajectory under A?"). The theory I've laid
out allows precise claims, which is a modest benefit to our understanding.
Before, we just had intuitions about some vague concept called "instrumental
convergence".
Here's bad reasoning, which implies that the cow tears a hole in spacetime:
The problem is that it's impractical to predict what a smarter agent will do, or
wh
2Richard Ngo3y
I'm afraid I'm mostly going to disengage here, since it seems more useful to
spend the time writing up more general + constructive versions of my arguments,
rather than critiquing a specific framework.
If I were to sketch out the reasons I expect to be skeptical about this
framework if I looked into it in more detail, it'd be something like:
1. Instrumental convergence isn't training-time behaviour, it's test-time
behaviour. It isn't about increasing reward, it's about achieving goals (that
the agent learned by being trained to increase reward).
2. The space of goals that agents might learn is very different from the space
of reward functions. As a hypothetical, maybe it's the case that neural networks
are just really good at producing deontological agents, and really bad at
producing consequentialists. (E.g, if it's just really really difficult for
gradient descent to get a proper planning module working). Then agents trained
on almost all reward functions will learn to do well on them without developing
convergent instrumental goals. (I expect you to respond that being deontological
won't get you to optimality. But I would say that talking about "optimality"
here ruins the abstraction, for reasons outlined in my previous comment).
1Alex Turner3y
I was actually going to respond, "that's a good point, but (IMO) a different
concern than the one you initially raised". I see you making two main critiques.
1. (paraphrased) "A won't produce optimal policies for the specified reward
function [even assuming alignment generalization off of the training
distribution], so your model isn't useful" – I replied to this critique
above.
2. "The space of goals that agents might learn is very different from the space
of reward functions." I agree this is an important part of the story. I
think the reasonable takeaway is "current theorems on instrumental
convergence help us understand what superintelligent A won't do, assuming no
reward-result gap. Since we can't assume alignment generalization, we should
keep in mind how the inductive biases of gradient descent affect the
eventual policy produced."
I remain highly skeptical of the claim that applying this idealized theory of
instrumental convergence worsens our ability to actually reason about it.
ETA: I read some information you privately messaged me, and i see why you might
see the above two points as a single concern.
1DanielFilan3y
I object to the claim that agents that act randomly can be made "arbitrarily
simple". Randomness is basically definitionally complicated!
1Richard Ngo3y
Eh, this seems a bit nitpicky. It's arbitrarily simple given a call to a
randomness oracle, which in practice we can approximate pretty easily. And it's
"definitionally" easy to specify as well: "the function which, at each call,
returns true with 50% likelihood and false otherwise."
1DanielFilan3y
If you get an 'external' randomness oracle, then you could define the utility
function pretty simply in terms of the outputs of the oracle.
If the agent has a pseudo-random number generator (PRNG) inside it, then I
suppose I agree that you aren't going to be able to give it a utility function
that has the standard set of convergent instrumental goals, and PRNGs can be
pretty short. (Well, some search algorithms are probably shorter, but I bet they
have higher Kt complexity, which is probably a better measure for agents)
1Matthew "Vaniver" Gray3y
I'd take a different tack here, actually; I think this depends on what the input
to the utility function is. If we're only allowed to look at 'atomic reality',
or the raw actions the agent takes, then I think your analysis goes through,
that we have a simple causal process generating the behavior but need a very
complicated utility function to make a utility-maximizer that matches the
behavior.
But if we're allowed to decorate the atomic reality with notes like "this action
was generated randomly", then we can have a utility function that's as simple as
the generator, because it just counts up the presence of those notes. (It
doesn't seem to me like this decorator is meaningfully more complicated than the
thing that gave us "agents taking actions" as a data source, so I don't think
I'm paying too much here.)
This can lead to a massive explosion in the number of possible utility functions
(because there's a tremendous number of possible decorators), but I think this
matches the explosion that we got by considering agents that were the outputs of
causal processes in the first place. That is, consider reasoning about python
code that outputs actions in a simple game, where there are many more possible
python programs than there are possible policies in the game.
1Richard Ngo3y
So in general you can't have utility functions that are as simple as the
generator, right? E.g. the generator could be deontological. In which case your
utility function would be complicated. Or it could be random, or it could choose
actions by alphabetical order, or...
And so maybe you can have a little note for each of these. But now what it
sounds like is: "I need my notes to be able to describe every possible cognitive
algorithm that the agent could be running". Which seems very very complicated.
I guess this is what you meant by the "tremendous number" of possible
decorators. But if that's what you need to do to keep talking about "utility
functions", then it just seems better to acknowledge that they're broken as an
abstraction.
E.g. in the case of python code, you wouldn't do anything analogous to this. You
would just try to reason about all the possible python programs directly.
Similarly, I want to reason about all the cognitive algorithms directly.
1Matthew "Vaniver" Gray3y
That's right.
I realized my grandparent comment is unclear here:
This should have been "consequence-desirability-maximizer" or something, since
the whole question is "does my utility function have to be defined in terms of
consequences, or can it be defined in terms of arbitrary propositions?". If I
want to make the deontologist-approximating Innocent-Bot, I have a terrible time
if I have to specify the consequences that correspond to the bot being innocent
and the consequences that don't, but if you let me say "Utility = 0 - badness of
sins committed" then I've constructed a 'simple' deontologist. (At least, about
as simple as the bot that says "take random actions that aren't sins", since
both of them need to import the sins library.)
In general, I think it makes sense to not allow this sort of elaboration of what
we mean by utility functions, since the behavior we want to point to is the
backwards assignment of desirability to actions based on the desirability of
their expected consequences, rather than the expectation of any arbitrary
property.
---
Actually, I also realized something about your original comment which I don't
think I had the first time around; if by "some reasonable percentage of an
agent's actions are random" you mean something like "the agent does
epsilon-exploration" or "the agent plays an optimal mixed strategy", then I
think it doesn't at all require a complicated utility function to generate
identical behavior. Like, in the rock-paper-scissors world, and with the simple
function 'utility = number of wins', the expected utility maximizing move
(against tough competition) is to throw randomly, and we won't falsify the
simple 'utility = number of wins' hypothesis by observing random actions.
Instead I read it as something like "some unreasonable percentage of an agent's
actions are random", where the agent is performing some simple-to-calculate
mixed strategy that is either suboptimal or only optimal by luck (when the
optimal mixed strat
2Richard Ngo3y
This is in fact the intended reading, sorry for ambiguity. Will edit. But note
that there are probably very few situations where exploring via actual
randomness is best; there will almost always be some type of exploration which
is more favourable. So I don't think this helps.
To be pedantic: we care about "consequence-desirability-maximisers" (or in
Rohin's terminology, goal-directed agents) because they do backwards assignment.
But I think the pedantry is important, because people substitute
utility-maximisers for goal-directed agents, and then reason about those agents
by thinking about utility functions, and that just seems incorrect.
What do you mean by optimal here? The robot's observed behaviour will be optimal
for some utility function, no matter how long you run it.
1Matthew "Vaniver" Gray3y
Valid point.
This also seems right. Like, my understanding of what's going on here is we
have:
* 'central' consequence-desirability-maximizers, where there's a simple utility
function that they're trying to maximize according to the VNM axioms
* 'general' consequence-desirability-maximizers, where there's a complicated
utility function that they're trying to maximize, which is selected because
it imitates some other behavior
The first is a narrow class, and depending on how strict you are with
'maximize', quite possibly no physically real agents will fall into it. The
second is a universal class, which instantiates the 'trivial claim' that
everything is utility maximization.
Put another way, the first is what happens if you hold utility fixed / keep
utility simple, and then examine what behavior follows; the second is what
happens if you hold behavior fixed / keep behavior simple, and then examine what
utility follows.
Distance from the first is what I mean by "the further a robot's behavior is
from optimal"; I want to say that I should have said something like
"VNM-optimal" but actually I think it needs to be closer to "simple utility
VNM-optimal."
I think you're basically right in calling out a bait-and-switch that sometimes
happens, where anyone who wants to talk about the universality of expected
utility maximization in the trivial 'general' sense can't get it to do any work,
because it should all add up to normality, and in normality there's a meaningful
distinction between people who sort of pursue fuzzy goals and ruthless utility
maximizers.
(Written quickly and not very carefully.)
I think it's worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover", and Cohen et al.'s "Advanced artificial agents intervene in the provision of reward". They focus on policies learning the goal of getting high reward. But I have two problems with this:
- I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they'd converge to it eventually, but my guess is that this would take long enough that we'd already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the "convergence" argument). Analogously, humans don't care very much at all about the specific connections between our reward centers and the rest of our brains - insofar as we do want to influence them it's because we care about much more directly-observable p
... (read more)Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.
A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it's very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.
(For existing models, it might be more productive to ask for "surprising behavior" rather than "failures" per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though... but maybe understanding models better is robustly good enough to outweight that?)
Deceptive alignment doesn't preserve goals.
A short note on a point that I'd been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misaligned goal were "make as many paperclips as possible", but the goal "make as many staples as possible" could be represented more simply in the weights, then the weights should slowly drift from the former to the latter throughout training.
But actually, it'd likely be even simpler to get rid of the underlying misaligned goal, and just have alignment with the outer reward function as the terminal goal. So this argument suggests that even policies which start off misaligned would plausibly become aligned if they had to act deceptively aligned for long enough. (This sometimes happens in humans too, btw.)
Reasons this argument might not be relevant:
- The policy doing some kind of gradient hacking
- The policy being implemented using some kind of modular architecture (which may explain why this phenomenon isn't very robust in humans)
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
Probably the easiest "honeypot" is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that's anything like "get more reward" (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
The crucial heuristic I apply when evaluating AI safety research directions is: could we have used this research to make humans safe, if we were supervising the human evolutionary process? And if not, do we have a compelling story for why it'll be easier to apply to AIs than to humans?
Sometimes this might be too strict a criterion, but I think in general it's very valuable in catching vague or unfounded assumptions about AI development.
Five clusters of alignment researchers
Very broadly speaking, alignment researchers seem to fall into five different clusters when it comes to thinking about AI risk:
- MIRI cluster. Think that P(doom) is very high, based on intuitions about instrumental convergence, deceptive alignment, etc. Does work that's very different from mainstream ML. Central members: Eliezer Yudkowsky, Nate Soares.
- Structural risk cluster. Think that doom is more likely than not, but not for the same reasons as the MIRI cluster. Instead, this cluster focuses on systemic risks, multi-agent alignment, selective forces outside gradient descent, etc. Often work that's fairly continuous with mainstream ML, but willing to be unusually speculative by the standards of the field. Central members: Dan Hendrycks, David Krueger, Andrew Critch.
- Constellation cluster. More optimistic than either of the previous two clusters. Focuses more on risk from power-seeking AI than the structural risk cluster, but does work that is more speculative or conceptually-oriented than mainstream ML. Central members: Paul Christiano, Buck Shlegeris, Holden Karnofsky. (Named after Constellation coworking space.)
- Prosaic cluster. Focuses on empi
... (read more)Imagine taking someone's utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I'd want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.
But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret "similar to me" as de dicto vs de re - i.e. whether it refers to the old me or the new me.
This is a more general problem when one person's utility function can depend on another person's, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There's probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct typ... (read more)
A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.
I think this is useful for framing my core concerns about current safety research:
I do think it's more complicated than I've portrayed here, but I haven't yet seen a persuasive response to the core intuition.
Oracle-genie-sovereign is a really useful distinction that I think I (and probably many others) have avoided using mainly because "genie" sounds unprofessional/unacademic. This is a real shame, and a good lesson for future terminology.
I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven't yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs - i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)
I suspect that AIXI is misleading to think about in large part because it lacks reusable parameters - instead it just memorises all inputs it's seen so far. Which means the setup doesn't have episodes, or a training/deployment distinction; nor is any behaviour actually "reinforced".
A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.
The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.
What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to und... (read more)
There's some possible world in which the following approach to interpretability works:
One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altr... (read more)
I've heard people argue that "most" utility functions lead to agents with strong convergent instrumental goals. This obviously depends a lot on how you quantify over utility functions. Here's one intuition in the other direction. I don't expect this to be persuasive to most people who make the argument above (but I'd still be interested in hearing why not).
If a non-negligible percentage of an agent's actions are random, then to describe it as a utility-maximiser would require an incredibly complex utility function (becaus... (read more)
Makes sense. For what it's worth, I'd also argue that thinking about optimal policies at all is misguided (e.g. what's the optimal policy for humans - the literal best arrangement of neurons we could possibly have for our reproductive fitness? Probably we'd be born knowing arbitrarily large amounts of information. But this is just not relevant to predicting or modifying our actual behaviour at all).
(I now think that you were very right in saying "thinking about optimal policies at all is misguided", and I was very wrong to disagree. I've thought several times about this exchange. Not listening to you about this point was a serious error and made my work way less impactful. I do think that the power-seeking theorems say interesting things, but about eg internal utility functions over an internal planning ontology -- not about optimal policies for a reward function.)