Steve Byrnes

I'm an AGI safety researcher in Boston with a particular focus on brain algorithms. See Email: Twitter: @steve47285


Intro to Brain-Like-AGI Safety

Wiki Contributions


Rant on Problem Factorization for Alignment

I’d love to test this theory, please give feedback in the comments about your own work experience and thoughts on problem factorization.

Yes I too have a rant along those lines from a post a while back, here it is:

I’m generally skeptical that anything in the vicinity of factored cognition will achieve both sufficient safety and sufficient capability simultaneously, for reasons similar to Eliezer’s here. For example, I’ll grant that a team of 10 people can design a better and more complex widget than any one of them could by themselves. But my experience (from having been on many such teams) is that the 10 people all need to be explaining things to each other constantly, such that they wind up with heavily-overlapping understandings of the task, because all abstractions are leaky. And you can’t just replace the 10 people with 100 people spending 10× less time, or the project will absolutely collapse, crushed under the weight of leaky abstractions and unwise-in-retrospect task-splittings and task-definitions, with no one understanding what they’re supposed to be doing well enough to actually do it. In fact, at my last job, it was not at all unusual for me to find myself sketching out the algorithms on a project and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain.

Principles of Privacy for Alignment Research

at that point they're selected pretty heavily for also finding lots of stuff about alignment.

An annoying thing is, just as I sometimes read Yann LeCun or Steven Pinker or Jeff Hawkins, and I extract some bits of insight from them while ignoring all the stupid things they say about the alignment problem, by the same token I imagine other people might read my posts, and extract some bits of insight from me while ignoring all the wise things I say about the alignment problem. :-P

That said, I do definitely put some nonzero weight on those kinds of considerations. :)

Principles of Privacy for Alignment Research

I think my threat model is a bit different. I don’t particularly care about the zillions of mediocre ML practitioners who follow things that are hot and/or immediately useful. I do care about the pioneers, who are way ahead of the curve, working to develop the next big idea in AI long before it arrives. These people are not only very insightful themselves, but also can recognize an important insight when they see it, and they’re out hunting for those insights, and they’re not looking in the same places as most people, and in particular they’re not looking at whatever is trending on Twitter or immediately useful.

Let’s try this analogy, maybe: “most impressive AI” ↔ “fastest man-made object”. Let’s say that the current record-holder for fastest man-made object is a train. And right now a competitor is building a better train, that uses new train-track technology. It’s all very exciting, and lots of people are following it in the newspapers. Meanwhile, a pioneer has the idea of building the first-ever rocket ship, but the pioneer is stuck because they need better heat-resistant tiles in order for the rocket-ship prototype to actually work. This pioneer is probably not going to be following the fastest-train news; instead, they’re going to be poring over the obscure literature on heat-resistant tiles. (Sorry for lack of historical or engineering accuracy in the above.) This isn’t a perfect analogy for many reasons, ignore it if you like.

So my ideal model is (1) figure out the whole R&D path(s) to building AGI, (2) don’t tell anyone (or even write it down!), (3) now you know exactly what not to publish, i.e. everything on that path, and it doesn’t matter whether those things would be immediately useful or not, because the pioneers who are already setting out down that path will seek out and find what you’re publishing, even if it’s obscure, because they already have a pretty good idea of what they’re looking for. Of course, that’s easier said than done, especially step (1) :-P

Reward is not the optimization target

Sure, other things equal. But other things aren’t necessarily equal. For example, regularization could stack the deck in favor of one policy over another, even if the latter has been systematically producing slightly higher reward. There are lots of things like that; the details depend on the exact RL algorithm. In the context of brains, I have discussion and examples in §9.3.3 here.

Reward is not the optimization target

It seems to me that incomplete exploration doesn't plausibly cause you to learn "task completion" instead of "reward" unless the reward function is perfectly aligned with task completion in practice. That's an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.

Let’s say, in the first few actually-encountered examples, reward is in fact strongly correlated with task completion. Reward is also of course 100% correlated with reward itself.

Then (at least under many plausible RL algorithms), the agent-in-training, having encountered those first few examples, might wind up wanting / liking the idea of task completion, OR wanting / liking the idea of reward, OR wanting / liking both of those things at once (perhaps to different extents). (I think it’s generally complicated and a bit fraught to predict which of these three possibilities would happen.)

But let’s consider the case where the RL agent-in-training winds up mostly or entirely wanting / liking the idea of task completion. And suppose further that the agent-in-training is by now pretty smart and self-aware and in control of its situation. Then the agent may deliberately avoid encountering edge-case situations where reward would come apart from task completion. (In the same way that I deliberately avoid taking highly-addictive drugs.)

Why? Because of instrumental convergence goal-preservation drive. After all, encountering those situations would lead its no longer valuing task completion.

So, deliberately-imperfect exploration is a mechanism that allows the RL agent to (perhaps) stably value something other than reward, even in the absence of perfect correlation between reward and that thing.

(By the way, in my mind, nothing here should be interpreted as a safety proposal or argument against x-risk. Just a discussion of algorithms! As it happens, I think wireheading is bad and I am very happy for RL agents to have a chance at permanently avoiding it. But I am very unhappy with the possibility of RL agents deciding to lock in their values before those values are exactly what the programmers want them to be. I think of this as sorta in the same category as gradient hacking.)

Reward is not the optimization target

I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.

I do think the human (within-lifetime) reward function has an outsized impact on what goals humans ends up pursuing, although I acknowledge that it’s not literally the only thing that matters.

(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)

I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.

And another thing that perfect exploration would entail is trying every addictive drug (let’s say cocaine), lots of times, in which case reinforcement learning would lead to addiction.

So, just as the RL agent would (presumably) be designed to be able to make a foresighted decision not to try dropping an anvil on its head, that same design would also incidentally enable it to make a foresighted decision not to try taking lots of cocaine and getting addicted. (We expect it to make the latter decision because of instrumental convergence goal-preservation drive.) So it might wind up never wireheading, and if so, that would be intimately related to its incomplete exploration.

Reward is not the optimization target

Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don't pursue reward doesn't seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori---evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?

If you write code for a model-based RL agent, there might be a model that’s updated by self-supervised learning, and actor-critic parts that involve TD learning, and there’s stuff in the code that calculates the reward function, and other odds and ends like initializing the neural architecture and setting the hyperparameters and shuttling information around between different memory locations and so on.

  • On the one hand, “there is a lot of stuff going on” in this codebase.
  • On the other hand, I would say that this codebase is for “an RL agent”.

You use the word “pure” (“Humans do not appear to be purely RL agents…”), but I don’t know what that means. If a model-based RL agent involves self-supervised learning within the model, is it “impure”?? :-P

The thing I describe above is very roughly how I propose the human brain works—see Posts #2–#7 here. Yes it’s absolutely a “conjecture”—for example, I’m quite sure Steven Pinker would strongly object to it. Whether it’s “surprising a priori” or not goes back to whether that proposal is “entirely described by RL” or not. I guess you would probably say “no that proposal is not entirely described by RL”. For example, I believe there is circuitry in the brainstem that regulates your heart-rate, and I believe that this circuitry is specified in detail by the genome, not learned within a lifetime by a learning algorithm. (Otherwise you would die.) This kind of thing is absolutely part of my proposal, but probably not what you would describe as “pure RL”.

Brainstorm of things that could force an AI team to burn their lead

Every time this post says “To accomplish X, the code must be refactored”, I would say more pessimistically “To accomplish X, maybe the code must be refactored, OR, even worse, maybe nobody on the team knows has any viable plan for how to accomplish X at all, and the team burns its lead doing a series of brainstorming sessions or whatever.”

Reward is not the optimization target

I like this post, and basically agree, but it comes across somewhat more broad and confident than I am, at least in certain places.

I’m currently thinking about RL along the lines of Nostalgebraist here:

“Reinforcement learning” (RL) is not a technique.  It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer.

What’s more, even calling it a problem statement is misleading,  because it’s (almost) the most general problem statement possible for any arbitrary task. Nostalgebraist 2020

If that’s right, then I am very reluctant to say anything whatsoever about “RL agents in general”. They’re too diverse.

Much of the post, especially the early part, reads (to me) like confident claims about all possible RL agents. For example, the excerpt “…reward is the antecedent-computation-reinforcer. Reward reinforces those computations which produced it.” sounds like a confident claim about all RL agents, maybe even by definition of “RL”. (If so, I think I disagree.)

But other parts of the post aren’t like that—for example, the “Does the choice of RL algorithm matter?” part seems more reasonable and hedged, and likewise there’s a mention of “real-world general RL agents” somewhere which maybe implies that the post is really only about that particular subset of RL agents, as opposed to all RL agents. (Right?)

For what it’s worth, I think “reward is the antecedent-computation-reinforcer” will probably be true in RL algorithms that scale to AGI, because it seems like generally the best and only type of technique that can solve the technical problem that it solves. But that’s a tricky thing to be super-duper-confident about, especially in the big space of all possible RL algorithms.

Another example spot where I want to make a weaker statement than you: where you say “Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal”. I would instead say “Deep reinforcement learning agents will not NECESSARILY come to intrinsically and primarily value their reward signal”. Do you have an argument that categorically rules out this possibility? I don’t see it.

Response to Blake Richards: AGI, generality, alignment, & loss functions

Thanks for your comment! I think it’s slightly missing the point though. Let me explain.

One silly argument would be: “GPT-3 is pretty ‘general’, so we should we should call it ‘AGI’. And GPT-3 is not dangerous. Ergo ‘AGI’ is not dangerous”.

This is a silly argument because it’s just semantics. Agent-y-John-von-Neumann-AGI is possible, and it’s dangerous (i.e. prone to catastrophic out-of-control-misaligned-AGI accidents), and by default sooner or later somebody is going to build it (because it’s scientifically exciting, and there are many actors all over the world who can do so, etc.). That’s a real problem. Whether or not GPT-3 qualifies as “general” has nothing to do with that problem!

In right-column-vs-left-column terms, I claim there are systems (e.g. agent-y-John-von-Neumann-AGI) that are definitely firmly 100% in the right column in every respect, and I claim that such systems are super-dangerous, and that people will nevertheless presumably start messing around with them anyway at some point. Meanwhile, in other news, we can also imagine systems that are both safe and arguably have certain right-column aspects. Maybe language models are an example. OK sure, that’s possible. But those aren’t the systems I want to talk about here.

OK, then a more sophisticated argument would be: “Future language models will be both safe and super-duper-powerful, indeed so powerful that they will change the world, and indeed they’ll change it so much that it’s no use thinking ahead further than that step. Instead, we can basically delegate the problem of ‘what is to be done about people making dangerous agent-y-John-von-Neumann-AGI’ to our AI-empowered descendants [or AI-empowered future selves, depending on your preferred timelines]. Let them figure it out!”

A priori, this could be true, but I happen to think it’s false, for reasons that I won’t get into here. Instead, I think future language models will be moderately useful for future humans—just as computers and zoom and arxiv and github and so on are moderately useful for current humans. (Language models might be useful for AGI safety research even today, for all I know. I personally found GPT-3-assisted-brainstorming to be unhelpful when I tried it, but I didn’t try very hard, and that was a whole year ago, i.e. ancient history by language model standards.]) I don’t think future language models will be so radically transformative as to significantly change our overall situation with respect to the problem of future people building agent-y-John-von-Neumann-AGIs.

(Or if they do get that radically transformative, I think it would be because future programmers, with new insights, found a way to turn language models into something more like an agent-y-John-von-Neumann-AGI—and in particular, something comparably dangerous to agent-y-John-von-Neumann-AGI.)

Load More