Working on AGI safety via a deep-dive into brain algorithms, see https://sjbyrnes.com/agi.html
My current working theory of human social interactions does not involve multiple reward signals. Instead it's a bunch of rules like "If you're in state X, and you empathetically simulate someone in state Y, then send reward R and switch to state Z". See my post "Little glimpses of empathy" as the foundation of social emotions. These rules would be implemented in the hypothalamus and/or brainstem.
(Plus some involvement from brainstem sensory-processing circuits that can run hardcoded classifiers that return information about things like whether a person is present right now, and maybe some aspects of their tone of voice and facial expressions, etc. Then those data can also be inputs to the "bunch of rules".)
I haven't thought it through in any level of detail or read the literature (except superficially). Maybe ask me again in a few months… :-)
On your equivalence to an AI with an interpretability/oversight module. Data shouldn't be flowing back from the oversight into the AI.
Sure. I wrote "similar to (or even isomorphic to)". We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better setup.
I would be potentially concerned that this is a trick that evolution can use, but human AI designers can't use safely.
Sure, that's possible.
My "negative" response is: There's no royal road to safe AGI, at least not that anyone knows of so far. In particular, if we talk specifically about "subagent"-type situations where there are mutually-contradictory goals within the AGI, I think that this is simply a situation we have to deal with, whether we like it or not. And if there's no way to safely deal with that kind of situation, then I think we're doomed. Why do I think that? For one thing, as I wrote in the text, it's arbitrary where we draw the line between "the AGI" and "other algorithms interacting with and trying to influence the AGI". If we draw a box around the AGI to also include things like gradient updates, or online feedback from humans, then we're definitely in that situation, because these are subsystems that are manipulating the AGI and don't share the AGI's (current) goals. For another thing: it's a complicated world and the AGI is not omniscient. If you think about logical induction, the upshot is that when venturing into a complicated domain with unknown unknowns, you shouldn't expect nice well-formed self-consistent hypotheses attached to probabilities, you should expect a pile of partial patterns (i.e. hypotheses which make predictions about some things but are agnostic about others), supported by limited evidence. Then you can get situations where those partial patterns push in different directions, and "bid against each other". Now just apply exactly that same reasoning to "having desires about the state of the (complicated) world", and you wind up concluding that "subagents working against each other" is a default expectation and maybe even inevitable.
My "positive" response is: I certainly wouldn't propose to set up a promising-sounding reward system and then crack a beer and declare that we solved AGI safety. First we need a plan that might work (and we don't even have that yet, IMO!) and then we think about how it might fail, and how to modify the plan so that we can reason more rigorously about how it would work, and add in extra layers of safety (like testing, transparency, conservatism, boxing) in case even our seemingly-rigorous reasoning missed something, and so on.
Ben Goertzel comments on this post via twitter:
1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration
1) Nice post ... IMO the "Human-Like Social Instincts" direction has best odds of success; the notion of making AGIs focused on compassion and unconditional love (understanding these are complex messy human concept-plexes) appears to fall into this category as u loosely define it
2) Of course to make compassionate/loving AGI actually work, one needs a reasonable amount of corrigibility in one's AGI cognitive architecture, many aspects of which seem independent of whether compassion/love or something quite different is the top-level motivation/inspiration
how does it avoid wireheading
Um, unreliably, at least by default. Like, some humans are hedonists, others aren't.
I think there's a "hardcoded" credit assignment algorithm. When there's a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I'm not sure of the gory details here.)
Anyway, insofar as "the reward signal itself" is part of the world-model, it's possible that reward-prediction / value will wind up attached to that concept. And then that's a desire to wirehead. But it's not inevitable. Some of the relevant dynamics are:
For AGIs we would probably want to do other things too, like (somehow) use transparency to find "the reward signal itself" in the world-model and manually fix its reward-prediction / value at zero, or whatever else we can think of. Also, I think the more likely failure mode is "wireheading-lite", where the desire to wirehead is trading off against other things it cares about, and then hopefully conservatism (section 2 here) can help prevent catastrophe.
I had totally forgotten about your subagents post.
this post doesn't cleanly distinguish between reward-maximization and utility-maximization
I've been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize "the reward as currently understood by my predictive model". Then "the reward as currently understood by my predictive model" is basically a utility function. But at the same time, there's a separate subroutine that edits the reward prediction model (≈ utility function) to ever more closely approximate the true reward function (by some learning algorithm, presumably involving reward prediction errors).
In other words: At any given time, the part of the agent that's making plans and taking actions looks like a utility maximizer. But if you lump together that part plus the subroutine that keeps editing the reward prediction model to better approximate the real reward signal, then that whole system is a reward-maximizing RL agent.
Please tell me if that makes any sense or not; I've been planning to write pretty much exactly this comment (but with a diagram) into a short post.
I'm all for doing lots of testing in simulated environments, but the real world is a whole lot bigger and more open and different than any simulation. Goals / motivations developed in a simulated environment might or might not transfer to the real world in the way you, the designer, were expecting.
So, maybe, but for now I would call that "an intriguing research direction" rather than "a solution".
Right, the word "feasibly" is referring to the bullet point that starts "Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?”". Here's a little toy example we can run with: teaching an AGI "don't kill all humans". So there are three approaches to reward design that I can think of, and none of them seem to offer a feasible way to do this (at least, not with currently-known techniques):
Is there any good AI alignment research that you don't classify as deconfusion? If so, can you give some examples?
a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile
I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y'know, having read the sequences… :-P
I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn't need to share my goals.
The way I think about it is:
Second, there's an obstacle to pragmatic/practical considerations entering into epistemics. We need to focus on predicting important things; we need to control the amount of processing power spent; things in that vein. But (on the two-level view) we can't allow instrumental concerns to contaminate epistemics! We risk corruption!
I mean, if the instrumental level has any way whatsoever to influence the epistemic level, it will be able to corrupt it with false beliefs if it's hell-bent on doing so, and if it's sufficiently intelligent and self-aware. But remember we're not protecting against a superintelligent adversary; we're just trying to "navigate the treacherous waters" I mentioned above. So the goal is to allow what instrumental influence we can on the epistemic system, while making it hard and complicated to outright corrupt the epistemic system. I think the things that human brains do for that are:
I'm proposing that (1) the hypothalamus has an input slot for "flinch now", (2) VTA has an output signal for "should have flinched", (3) there is a bundle of partially-redundant side-by-side loops (see the "probability distribution" comment) that connect specifically to both (1) and (2), by a genetically-hardcoded mechanism.
I take your comment to be saying: Wouldn't it be hard for the brain to orchestrate such a specific pair of connections across a considerable distance?
Well, I'm very much not an expert on how the brain wires itself up. But I think there's gotta be some way that it can do things like that. I feel like those kinds of feats of wiring are absolutely required for all kinds of reasons. Like, I think motor cortex connects directly to spinal hand-control nerves, but not foot-control nerves. How do the output neurons aim their paths so accurately, such that they don't miss and connect to the foot nerves by mistake? Um, I don't know, but it's clearly possible. "Molecular signaling" or something, I guess?
Alternatively we might imagine some separate mechanism for of priming the developing amygdala to start out with a diverse yet sensible array of behavior proposals, and the brainstem could learn what its outputs correspond to and then signal them appropriately.
Hmm, one reasonable (to me) possibility along these lines would be something like: "VTA has 20 dopamine output signals, and they're guided to wind up spread out across the amygdala, but not with surgical precision. Meanwhile the corresponding amygdala loops terminate in an "input zone" of the lateral hypothalamus, but not to any particular spot, instead they float around unsure of exactly what hypothalamus "entry point" to connect to. And there are 20 of these intended "entry points" (collections of neurons for flinching, scowling, etc.). OK, then during embryonic development, the entry-point neurons are firing randomly, and that signal goes around the loop—within the hypothalamus and to VTA, then up to the amygdala, then back down to that floating neuron. Then Hebbian learning—i.e. matching the random code—helps the right loop neuron find its way to the matching hypothalamus entry point."
I'm not sure if that's exactly what you're proposing, but that seems like a perfectly plausible way for the brain to orchestrate these connections during embryonic development. I do have a hunch that this isn't what happens, that the real mechanism is "molecular signaling" instead. But like I said, I'm not an expert, and I certainly wouldn't be shocked to learn that the brain embryonic wiring mechanism involves this kind of thing where it closes a loop by sending a random code around the loop and Hebbian-learning the final connection.