Steve Byrnes

Working on AGI safety via a deep-dive into brain algorithms, see


Reward Is Not Enough


My current working theory of human social interactions does not involve multiple reward signals. Instead it's a bunch of rules like "If you're in state X, and you empathetically simulate someone in state Y, then send reward R and switch to state Z". See my post "Little glimpses of empathy" as the foundation of social emotions. These rules would be implemented in the hypothalamus and/or brainstem.

(Plus some involvement from brainstem sensory-processing circuits that can run hardcoded classifiers that return information about things like whether a person is present right now, and maybe some aspects of their tone of voice and facial expressions, etc. Then those data can also be inputs to the "bunch of rules".)

I haven't thought it through in any level of detail or read the literature (except superficially). Maybe ask me again in a few months… :-)

Reward Is Not Enough

On your equivalence to an AI with an interpretability/oversight module. Data shouldn't be flowing back from the oversight into the AI. 

Sure. I wrote "similar to (or even isomorphic to)". We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better setup.

I would be potentially concerned that this is a trick that evolution can use, but human AI designers can't use safely. 

Sure, that's possible.

My "negative" response is: There's no royal road to safe AGI, at least not that anyone knows of so far. In particular, if we talk specifically about "subagent"-type situations where there are mutually-contradictory goals within the AGI, I think that this is simply a situation we have to deal with, whether we like it or not. And if there's no way to safely deal with that kind of situation, then I think we're doomed. Why do I think that? For one thing, as I wrote in the text, it's arbitrary where we draw the line between "the AGI" and "other algorithms interacting with and trying to influence the AGI". If we draw a box around the AGI to also include things like gradient updates, or online feedback from humans, then we're definitely in that situation, because these are subsystems that are manipulating the AGI and don't share the AGI's (current) goals. For another thing: it's a complicated world and the AGI is not omniscient. If you think about logical induction, the upshot is that when venturing into a complicated domain with unknown unknowns, you shouldn't expect nice well-formed self-consistent hypotheses attached to probabilities, you should expect a pile of partial patterns (i.e. hypotheses which make predictions about some things but are agnostic about others), supported by limited evidence. Then you can get situations where those partial patterns push in different directions, and "bid against each other". Now just apply exactly that same reasoning to "having desires about the state of the (complicated) world", and you wind up concluding that "subagents working against each other" is a default expectation and maybe even inevitable.

My "positive" response is: I certainly wouldn't propose to set up a promising-sounding reward system and then crack a beer and declare that we solved AGI safety. First we need a plan that might work (and we don't even have that yet, IMO!) and then we think about how it might fail, and how to modify the plan so that we can reason more rigorously about how it would work, and add in extra layers of safety (like testing, transparency, conservatism, boxing) in case even our seemingly-rigorous reasoning missed something, and so on.

Solving the whole AGI control problem, version 0.0001

Ben Goertzel comments on this post via twitter:

1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it

2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration

Reward Is Not Enough

how does it avoid wireheading

Um, unreliably, at least by default. Like, some humans are hedonists, others aren't.

I think there's a "hardcoded" credit assignment algorithm. When there's a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I'm not sure of the gory details here.)

Anyway, insofar as "the reward signal itself" is part of the world-model, it's possible that reward-prediction / value will wind up attached to that concept. And then that's a desire to wirehead. But it's not inevitable. Some of the relevant dynamics are:

  • Timing—if credit goes mainly to signals that slightly precede the reward prediction error, then the reward signal itself is not a great fit.
  • Explaining away—once you have a way to accurately predict some set of reward signals, it makes the reward prediction errors go away, so the credit assignment algorithm stops running for those signals. So the first good reward-predicting model gets to stick around by default. Example: we learn early in life that the "eating candy" concept predicts certain reward signals, and then we get older and learn that the "certain neural signals in my brain" concept predicts those same reward signals too. But just learning that fact doesn't automatically translate into "I really want those certain neural signals in my brain". Only the credit assignment algorithm can make a thought appealing, and if the rewards are already being predicted then the credit assignment algorithm is inactive. (This is kinda like the behaviorism concept of blocking.)
  • There may be some kind of bias to assign credit to predictive models that are simple functions of sensory inputs, when such a model exists, other things equal. (I'm thinking here of the relation between amygdala predictions, which I think are restricted to relatively simple functions of sensory input, versus mPFC predictions, which I think can involve more abstract situational knowledge. I'm still kinda confused about how this works though.)
  • There's a difference between hedonism-lite ("I want to feel good, although it's not the only thing I care about") and hedonism-level-10 ("I care about nothing whatsoever except feeling good"). My model would suggest that hedonism-lite is widespread, but hedonism-level-10 is vanishingly rare or nonexistent, because it requires that somehow all value gets removed from absolutely everything in the world-model except that one concept of the reward signal.

For AGIs we would probably want to do other things too, like (somehow) use transparency to find "the reward signal itself" in the world-model and manually fix its reward-prediction / value at zero, or whatever else we can think of. Also, I think the more likely failure mode is "wireheading-lite", where the desire to wirehead is trading off against other things it cares about, and then hopefully conservatism (section 2 here) can help prevent catastrophe.

Reward Is Not Enough


I had totally forgotten about your subagents post.

this post doesn't cleanly distinguish between reward-maximization and utility-maximization

I've been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize "the reward as currently understood by my predictive model". Then "the reward as currently understood by my predictive model" is basically a utility function. But at the same time, there's a separate subroutine that edits the reward prediction model (≈ utility function) to ever more closely approximate the true reward function (by some learning algorithm, presumably involving reward prediction errors).

In other words: At any given time, the part of the agent that's making plans and taking actions looks like a utility maximizer. But if you lump together that part plus the subroutine that keeps editing the reward prediction model to better approximate the real reward signal, then that whole system is a reward-maximizing RL agent.

Please tell me if that makes any sense or not; I've been planning to write pretty much exactly this comment (but with a diagram) into a short post.

Reward Is Not Enough

I'm all for doing lots of testing in simulated environments, but the real world is a whole lot bigger and more open and different than any simulation. Goals / motivations developed in a simulated environment might or might not transfer to the real world in the way you, the designer, were expecting.

So, maybe, but for now I would call that "an intriguing research direction" rather than "a solution".

Reward Is Not Enough

Right, the word "feasibly" is referring to the bullet point that starts "Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?”". Here's a little toy example we can run with: teaching an AGI "don't kill all humans". So there are three approaches to reward design that I can think of, and none of them seem to offer a feasible way to do this (at least, not with currently-known techniques):

  1. The agent learns by experiencing the reward. This doesn't work for "don't kill all humans" because when the reward happens it's too late.
  2. The reward calculator is sophisticated enough to understand what the agent is thinking, and issue rewards proportionate to the probability that the current thoughts and plans will eventually lead to the result-in-question happening. So the AGI thinks "hmm, maybe I'll blow up the sun", and the reward calculator recognizes that merely thinking that thought just now incrementally increased the probability that the AGI will kill all humans, and so it issues a negative reward. This is tricky because the reward calculator needs to have an intelligent understanding of the world, and of the AGI's thoughts. So basically the reward calculator is itself an AGI, and now we need to figure out its rewards. I'm personally quite pessimistic about approaches that involve towers-of-AGIs-supervising-other-AGIs, for reasons in section 3.2 here, although other people would disagree with me on that (partly because they are assuming different AGI development paths and architectures than I am).
  3. Same as above, but instead of a separate reward calculator estimating the probability that a thought or plan will lead to the result-in-question, we allow the AGI itself to do that estimation, by flagging a concept in its world-model called "I will kill all humans", and marking it as "very bad and important" somehow. (The inspiration here is a human who somehow winds up with the strong desire "I want to get out of debt". Having assigned value to that abstract concept, the human can assess for themselves the probabilities that different thoughts will increase or decrease the probability of that thing happening, and sorta issue themselves a reward accordingly.) The tricky part is (A) making sure that the AGI does in fact have that concept in its world-model (I think that's a reasonable assumption, at least after some training), (B) finding that concept in the massive complicated opaque world-model, in order to flag it. So this is the symbol-grounding problem I mentioned in the text. I can imagine solving it if we had really good interpretability techniques (techniques that don't currently exist), or maybe there are other methods, but it's an unsolved problem as of now.
Looking Deeper at Deconfusion

Is there any good AI alignment research that you don't classify as deconfusion? If so, can you give some examples?

The Credit Assignment Problem

a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile

I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y'know, having read the sequences… :-P

I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn't need to share my goals.

The way I think about it is:

  • Early in training, the AGI is too stupid to formulate and execute a plan to hack into its epistemic level.
  • Late in training, we can hopefully get to the place where the AGI's values, like mine, involve a concept of "there is a real world independent of my beliefs", and its preferences involve the state of that world, and therefore "get accurate beliefs" becomes instrumentally useful and endorsed.
  • In between … well … in between, we're navigating treacherous waters …

Second, there's an obstacle to pragmatic/practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can't allow instrumental concerns to contaminate epistemics! We risk corruption!

I mean, if the instrumental level has any way whatsoever to influence the epistemic level, it will be able to corrupt it with false beliefs if it's hell-bent on doing so, and if it's sufficiently intelligent and self-aware. But remember we're not protecting against a superintelligent adversary; we're just trying to "navigate the treacherous waters" I mentioned above. So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system. I think the things that human brains do for that are:

  1. The instrumental level gets some influence over what to look at, where to go, what to read, who to talk to, etc.
  2. There's a trick (involving acetylcholine) where the instrumental level has some influence over a multiplier on the epistemic level's gradients (a.k.a. learning rate). So epistemic level is always updates towards "more accurate predictions on this frame", but it updates infinitesimally in situations where prediction accuracy is instrumentally useless, and it updates strongly in situations where prediction accuracy is instrumentally important.
  3. There's a different mechanism that creates the same end result as #2: namely, the instrumental level has some influence over what memories get replayed more or less often.
  4. For #2 and #3, the instrumental level has some influence but not complete influence. There are other hardcoded algorithms running in parallel and flagging certain things as important, and the instrumental level has no straightforward way to prevent that from happening. 
Big picture of phasic dopamine

I'm proposing that (1) the hypothalamus has an input slot for "flinch now", (2) VTA has an output signal for "should have flinched", (3) there is a bundle of partially-redundant side-by-side loops (see the "probability distribution" comment) that connect specifically to both (1) and (2), by a genetically-hardcoded mechanism.

I take your comment to be saying: Wouldn't it be hard for the brain to orchestrate such a specific pair of connections across a considerable distance?

Well, I'm very much not an expert on how the brain wires itself up. But I think there's gotta be some way that it can do things like that. I feel like those kinds of feats of wiring are absolutely required for all kinds of reasons. Like, I think motor cortex connects directly to spinal hand-control nerves, but not foot-control nerves. How do the output neurons aim their paths so accurately, such that they don't miss and connect to the foot nerves by mistake? Um, I don't know, but it's clearly possible. "Molecular signaling" or something, I guess?

Alternatively we might imagine some separate mechanism for of priming the developing amygdala to start out with a diverse yet sensible array of behavior proposals, and the brainstem could learn what its outputs correspond to and then signal them appropriately.

Hmm, one reasonable (to me) possibility along these lines would be something like: "VTA has 20 dopamine output signals, and they're guided to wind up spread out across the amygdala, but not with surgical precision. Meanwhile the corresponding amygdala loops terminate in an "input zone" of the lateral hypothalamus, but not to any particular spot, instead they float around unsure of exactly what hypothalamus "entry point" to connect to. And there are 20 of these intended "entry points" (collections of neurons for flinching, scowling, etc.). OK, then during embryonic development, the entry-point neurons are firing randomly, and that signal goes around the loop—within the hypothalamus and to VTA, then up to the amygdala, then back down to that floating neuron. Then Hebbian learning—i.e. matching the random code—helps the right loop neuron find its way to the matching hypothalamus entry point."

I'm not sure if that's exactly what you're proposing, but that seems like a perfectly plausible way for the brain to orchestrate these connections during embryonic development. I do have a hunch that this isn't what happens, that the real mechanism is "molecular signaling" instead. But like I said, I'm not an expert, and I certainly wouldn't be shocked to learn that the brain embryonic wiring mechanism involves this kind of thing where it closes a loop by sending a random code around the loop and Hebbian-learning the final connection.

Load More