Steve Byrnes

I'm Steve Byrnes, a professional physicist in the Boston area. I have a summary of my AGI safety research interests at:


Birds, Planes, Brains, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Moreover, we probably won’t figure out how to make AIs that are as data-efficient as humans for a long time--decades at least.

I know you weren't endorsing this claim as definitely true, but FYI my take is that other families of learning algorithms besides deep neural networks are in fact as data-efficient as humans, particularly those related to probabilistic programming and analysis-by-synthesis, see examples here.

Multi-dimensional rewards for AGI interpretability and control

Yeah, sure. I should have said: there's no plausibly useful & practical scheme that is fundamentally different from scalars, for this, at least that I can think of. :-)

Defusing AGI Danger

That's fair, I should properly write out the brain-like AGI danger scenario(s) that have been in my head, one of these days. :-)

Hierarchical planning: context agents

I think maybe a more powerful framework than discrete contexts is that there's a giant soup of models, and the models have arrows pointing at other models, and multiple models can be active simultaneously, and the models can span different time scales. So you can have a "I am in the store" model, and it's active for the whole time you're shopping, and meanwhile there are faster models like "I am looking for noodles", and slower models like "go shopping then take the bus home". And anything can point to anything else. So then if you have a group of models that mainly point to each other, and less to other stuff, it's a bit of an island in the graph, and you can call it a "context". Like everything I know about chess strategy is mostly isolated from the rest of my universe of knowledge and ideas, so I could say I have a "chess strategy context". But that's an emergent property, not part of the data structure.

My impression is that the Goodhart's law thing at the end is a bit like saying "Don't think creatively"... Thinking creatively is making new connections where they don't immediately pop into your head. Is that reasonable? Sorry if I'm misunderstanding. :)

Homogeneity vs. heterogeneity in AI takeoff scenarios


It's unlikely for there to exist both aligned and misaligned AI systems at the same time—either all of the different AIs will be aligned to approximately the same degree or they will all be misaligned to approximately the same degree.

Is there an argument that it's impossible to fine-tune an aligned system into a misaligned one? Or just that everyone fine-tuning these systems will be smart and careful and read the manual etc. so that they do it right? Or something else?

I'd very much like to see more discussion of the extent to which different people expect homogenous vs. heterogenous takeoff scenarios

Thinking about it right now, I'd say "homogeneous learning algorithms, heterogeneous trained models" (in a multipolar type scenario at least). I guess my intuitions are (1) No matter how expensive "training from scratch" is, it's bound to happen a second time if people see that it worked the first time. (2) I'm more inclined to think that fine-tuning can make it into "more-or-less a different model", rather than necessarily "more-or-less the same model". I dunno.

Inner Alignment in Salt-Starved Rats

Thanks for your comment! I think that you're implicitly relying on a different flavor of "inner alignment" than the one I have in mind.

(And confusingly, the brain can be described using either version of "inner alignment"! With different resulting mental pictures in the two cases!!)

See my post Mesa-Optimizers vs "Steered Optimizers" for details on those two flavors of inner alignment.

I'll summarize here for convenience.

I think you're imagining that the AGI programmer will set up SGD (or equivalent) and the thing SGD does is analogous to evolution acting on the entire brain. In that case I would agree with your perspective.

I'm imagining something different:

The first background step is: I argue (for example here) that one part of the brain (I'm calling it the "neocortex subsystem") is effectively implementing a relatively simple, quasi-general-purpose learning-and-acting algorithm. This algorithm is capable of foresighted goal seeking, predictive-world-model-building, inventing new concepts, etc. etc. I don't think this algorithm looks much like a deep neural net trained by SGD; I think it looks like an learning algorithm that no human has invented yet, one which is more closely related to learning probabilistic graphical models than to deep neural nets. So that's one part of the brain, comprising maybe 80% of the weight of a human brain. Then there are other parts of the brain (brainstem, amygdala, etc.) that are not part of this subsystem. Instead, one thing they do is run calculations and interact with the "neocortex subsystem" in a way that tends to "steer" that "neocortex subsystem" towards behaving in ways that are evolutionarily adaptive. I think there are many different "steering" brain circuits, and they are designed to steer the neocortex subsystem towards seeking a variety of goals using a variety of mechanisms.

So that's the background picture in my head—and if you don't buy into that, nothing else I say will make sense.

Then the second step is: I'm imagining that the AGI programmers will build a learning-and-acting algorithm that resembles the "neocortex subsystem"'s learning-and-acting algorithm—not by some blind search over algorithm space, but by directly programming it, just as people have directly programmed AlphaGo and many other learning algorithms in the past. (They will do this either by studying how the neocortex works, or by reinventing the same ideas.) Once those programmers succeed, then OK, now these programmers will have in their hands a powerful quasi-general-purpose learning-and-acting (and foresighted goal-seeking, concept-inventing, etc.) algorithm. And then the programmer will be in a position analogous to the position of the genes wiring up those other brain modules (brainstem, etc.): the programmers will be writing code that tries to get this neocortex-like algorithm to do the things they want it to do. Let's say the code they write is a "steering subsystem".

The simplest possible "steering subsystem" is ridiculously simple and obvious: just a reward calculator that sends rewards for the exact thing that the programmer wants it to do. (Note: the "neocortex subsystem" algorithm has an input for reward signals, and these signals are involved in creating and modifying its internal goals.) And if the programmer unthinkingly build that kind of simple "steering subsystem", it would kinda work, but not reliably, for the usual reasons like ambiguity in extrapolating out-of-distribution, the neocortex-like algorithm sabotaging the steering subsystem, etc. But, we can hope that there are more complicated possible "steering subsystems" that would work better.

So then this article is part of a research program of trying to understand the space of possibilities for the "steering subsystem", and figuring out which of them (if any!!) would work well enough to keep arbitrarily powerful AGIs (of this basic architecture) doing what we want them to do.

Finally, if you can load that whole perspective into your head, I think from that perspective it's appropriate to say "the genome is “trying” to install that goal in the rat’s brain", just as you can say that a particular gene is "trying" to do a certain step of assembling a white blood cell or whatever. (The "trying" is metaphorical but sometimes helpful.) I suppose I should have said "rat's neocortex subsystem" instead of "rat's brain". Sorry about that.

Does that help? Sorry if I'm misunderstanding you. :-)

ricraz's Shortform

I wrote a few posts on self-supervised learning last year:

I'm not aware of any airtight argument that "pure" self-supervised learning systems, either generically or with any particular architecture, are safe to use, to arbitrary levels of intelligence, though it seems very much worth someone trying to prove or disprove that. For my part, I got distracted by other things and haven't thought about it much since then.

The other issue is whether "pure" self-supervised learning systems would be capable enough to satisfy our AGI needs, or to safely bootstrap to systems that are. I go back and forth on this. One side of the argument I wrote up here. The other side is, I'm now (vaguely) thinking that people need a reward system to decide what thoughts to think, and the fact that GPT-3 doesn't need reward is not evidence of reward being unimportant but rather evidence that GPT-3 is nothing like an AGI. Well, maybe.

For humans, self-supervised learning forms the latent representations, but the reward system controls action selection. It's not altogether unreasonable to think that action selection, and hence reward, is a more important thing to focus on for safety research. AGIs are dangerous when they take dangerous actions, to a first approximation. The fact that a larger fraction of neocortical synapses are adjusted by self-supervised learning than by reward learning is interesting and presumably safety-relevant, but I don't think it immediately proves that self-supervised learning has a similarly larger fraction of the answers to AGI safety questions. Maybe, maybe not, it's not immediately obvious. :-)

Load More