Steve Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, LinkedIn, and more at my website.

Sequences

Intro to Brain-Like-AGI Safety

Wikitag Contributions

Comments

Sorted by

Hmm, I guess my main cause for skepticism is that I think the setup would get subverted somehow—e.g. either the debaters, or the “human simulator”, or all three in collusion, will convince the human to let them out of the box. In your classification, I guess this would be a “high-stakes context”, which I know isn’t your main focus. You talk about it a bit, but I’m unconvinced by what you wrote (if I understood it correctly) and don’t immediately see any promising directions.

Secondarily, I find it kinda hard to believe that two superhuman debaters would converge to “successfully conveying subtle hard-to-grasp truths about the alignment problem to the judge” rather than converging to “manipulation tug-of-war on the judge”.

Probably at least part of the difference / crux between us is that, compared to most people, I tend to assume that there isn’t much of a stable, usable window between “AI that’s competent enough to really help” and “AI that’s radically superhuman”, and I know that you’re explicitly assuming “not extremely superhuman”. (And that in turn is probably at least partly related to the fact that you’re thinking about LLMs and I’m thinking about other AI paradigms.) So maybe this comment isn’t too helpful, oh well.

“almost everything in the world is solvable via (1) Human A wants it solved, (2) Agent B is motivated by the prospect of Human A pressing the reward button on Agent B if things turn out well, (3) Human A is somewhat careful not to press the button until they’re quite sure that things have indeed turned out well, (4) Agent B is able to make and execute long-term plans”.

In particular, every aspect of automating the economy is solvable that way—for example (I was just writing this in a different thread), suppose I have a reward button, and tell an AI:

Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!

And let’s assume the AI is purely motivated by the reward button, but not yet capable of brainwashing me or stealing my button. (I guess that’s rather implausible if it can already autonomously make $1B, but maybe we’re good at Option Control, or else substitute a less ambitious project like making a successful app or whatever.) And assume that I have no particular skill at “good evaluation” of AI outputs. I only know enough to hire competent lawyers and accountants for pretty basic due diligence, and it helps that I’m allowing an extra year for law enforcement or public outcry or whatever to surface any subtle or sneaky problems caused by my AI.

So that’s a way to automate the economy and make trillions of dollars (until catastrophic takeover) without making any progress on the “need for good evaluation” problem of §6.1. Right?

And I don’t buy your counterargument that the AI will fail at the “make $1B” project above (“trying to train on these very long-horizon reward signals poses a number of distinctive challenges…”) because e.g. that same argument would also “prove” that no human could possibly decide that they want to make $1B, and succeed. I think you’re thinking about RL too narrowly—but we can talk about that separately.

Thanks!

RE 2 – I was referring here to (what I call) “brain-like AGI”, a yet-to-be-invented AI paradigm in which both “human-like ability to reason” and “human-like social and moral instincts / reflexes” are in a nuts-and-bolts sense, like they’re actually doing the same kinds of algorithmic steps that a human brain would do. Human brains are quite different from LLMs, even if their text outputs can look similar. For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch. So bringing up “training data” is kinda the wrong idea. Indeed, for humans (and brain-like AGI), we should be talking about “training environments”, not “training data”—more like the RL agent paradigm of the late 2010s than like LLMs, at least in some ways.

I do agree that we shouldn’t trust LLMs to make good philosophical progress that goes way beyond what’s already in their human-created training data.

RE 1 – Let’s talk about feelings of friendship, compassion, and connection. These feelings are unnecessary for cooperation, right? Logical analysis of costs vs benefits of cooperation, including decision theory, reputational consequences, etc., are all you need for cooperation to happen. (See §2-3 here.) But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress]. “Selfishness” isn’t even a coherent concept unless the agent intrinsically wants something, and innate drives are upstream of what it wants, and those feelings of friendship, compassion etc. can be one of those innate drives, potentially a very strong one. Then, yes, there’s a further question about whom those feelings will be directed towards—AIs, humans, animals, teddy bears, or what? It needs to start with innate reflexes to certain stimuli, which then get smoothed out upon reflection. I think this is something where we’d need to think very carefully about the AI design, training environment, etc. It might be easier to think through what could go right and worng after developing a better understanding of human social instincts, and developing a more specific plan for the AI. Anyway, I certainly agree that there are important potential failure modes in this area.

RE 3 – I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective. (I’m not expressing doubt, I’m just struggling to see things from outside my own viewpoint here.)

By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”. But to defend the “least-bad” claim, I would need to go through all the other possible plans that seem better, and why I don’t find them plausible (on my idiosyncratic models of AGI development). I mentioned my skepticism about AGI pause / anti-proliferation above, and I have a (hopefully) forthcoming post that should talk about how I’m thinking about corrigibility and other stuff. (But I’m also happy to chat about it here; it would probably help me flesh out that forthcoming post tbh.)

However: I expect that AIs capable of causing a loss of control scenario, at least, would also be capable of top-human-level alignment research.

Hmm. A million fast-thinking Stalin-level AGIs would probably have a better shot of taking control than doing alignment research, I think?

Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not. (I think people have wildly different intuitions about how steep the alignment tax will be, so this might or might not be important. E.g. imagine a scenario where FOOM is possible but too dangerous for humans to allow. If so, that would be an astronomical alignment tax!)

we don’t tend to imagine humans directly building superintelligence

Speak for yourself! Humans directly built AlphaZero, which is a superintelligence for board game worlds. So I don’t think it’s out of the question that humans could directly build a superintelligence for the real world. I think that’s my main guess, actually?

(Obviously, the humans would be “directly” building a learning algorithm etc., and then the trained weights come out of that.)

(OK sure, the humans will use AI coding assistants. But I think AI coding assistants, at least of the sort that exist today, aren’t fundamentally changing the picture, but rather belong in the same category as IDEs and PyTorch and other such mundane productivity-enhancers.)

(You said “don’t tend to”, which is valid. My model here [AI paradigm shift → superintelligence very quickly and with little compute] does seem pretty unusual with respect to today’s alignment community zeitgeist.)

You could think, for example, that almost all of the core challenge of aligning a superintelligence is contained in the challenge of safely automating top-human-level alignment research. I’m skeptical, though. In particular: I expect superintelligent-level capabilities to create a bunch of distinctive challenges.

I’m curious if you could name some examples, from your perspective?

I’m just curious. I don’t think this is too cruxy—I think the cruxy-er part is how hard it is to safely automate top-human-level alignment research, not whether there are further difficulties after that.

…Well, actually, I’m not so sure. I feel like I’m confused about what “safely automating top-human-level alignment research” actually means. You say that it’s less than “handoff”. But if humans are still a required part of the ongoing process, then it’s not really “automation”, right? And likewise, if humans are a required part of the process, then an alignment MVP is insufficient for the ability to turn tons of compute into tons of alignment research really fast, which you seem to need for your argument.

You also talk elsewhere about “performing more limited tasks aimed at shorter-term targets”, which seems to directly contradict “performs all the cognitive tasks involved in alignment research at or above the level of top human experts”, since one such cognitive task is “making sure that all the pieces are coming together into a coherent viable plan”. Right?

Honestly I’m mildly concerned that an unintentional shell game might be going on regarding which alignment work is happening before vs. after the alignment MVP.

Relatedly, this sentence seems like an important crux where I disagree: “I am cautiously optimistic that for building an alignment MVP, major conceptual advances that can’t be evaluated via their empirical predictions are not required.” But again, that might be that I’m envisioning a more capable alignment MVP than you are.

Thanks for writing this! Leaving some comments with reactions as I was reading, not all very confident, and sorry if I missed or misunderstood things you wrote.

Problems with these evaluation techniques can arise in attempting to automate all sorts of domains (I’m particularly interested in comparisons with (a) capabilities research, and (b) other STEM fields). And I think this should be a source of comfort. In particular: these sorts of problems can slow down the automation of capabilities research, too. And to the extent they’re a bottleneck on all sorts of economically valuable automation, we should expect lots of effort to go towards resolving them. … [then more discussion in §6.1]

This feels wrong to me. I feel like “the human must evaluate the output, and doing so is hard” is more of an edge case, applicable to things like “designs for a bridge”, where failure is far away and catastrophic. (And applicable to alignment research, of course.)

Like you mention today’s “reward-hacking” (e.g. o3 deleting unit tests instead of fixing the code) as evidence that evaluation is necessary. But that’s a bad example because the reward-hacked code doesn’t actually work! And people notice that it doesn’t work. If the code worked flawlessly, then people wouldn’t be talking about reward-hacking as if it’s a bad thing. People notice eventually, and that constitutes an evaluation. Likewise, if you hire a lousy head of marketing, then you’ll eventually notice the lack of new customers; if you hire a lousy CTO, then you’ll eventually notice that your website doesn’t work; etc.

OK, you anticipate this reply and then respond with: “…And even if these tasks can be evaluated via more quantitative metrics in the longer-term (e.g., “did this business strategy make money?”), trying to train on these very long-horizon reward signals poses a number of distinctive challenges (e.g., it can take a lot of serial time, long-horizon data points can be scarce, etc).”

But I don’t buy that because, like, humans went to the moon. That was a long-horizon task but humans did not need to train on it, rather they did it with the same brains we’ve been using for millennia. It did require long-horizon goals. But (1) If AI is unable to pursue long-horizon goals, then I don’t think it’s adequate to be an alignment MVP (you address this in §9.1 & here, but I’m more pessimistic, see here & here), (2) If the AI is able to pursue long-horizon goals, then the goal of “the human eventually approves / presses the reward button” is an obvious and easily-trainable approach that will be adequate for capabilities, science, and unprecedented profits (but not alignment), right up until catastrophe. (Bit more discussion here.)

((1) might be related to my other comment, maybe I’m envisioning a more competent “alignment MVP” than you?)

I think that OP’s discussion of “number-go-up vs normal science vs conceptual research” is an unnecessary distraction, and he should have cut that part and just talked directly about the spectrum from “easy-to-verify progress” to “hard-to-verify progress”, which is what actually matters in context.

Partly copying from §1.4 here, you can (A) judge ideas via new external evidence, and/or (B) judge ideas via internal discernment of plausibility, elegance, self-consistency, consistency with already-existing knowledge and observations, etc. There’s a big range in people’s ability to apply (B) to figure things out. But what happens in “normal” sciences like biology is that there are people with a lot of (B), and they can figure out what’s going on, on the basis of hints and indirect evidence. Others don’t. The former group can gather ever-more-direct and ever-more-unassailable (A)-type evidence over time, and use that evidence as a cudgel with which to beat the latter group over the head until they finally get it. (“If you don’t believe my 7 independent lines of evidence for plate tectonics, OK fine I’ll go to the mid-Atlantic ridge and gather even more lines of evidence…”)

This is an important social tool, and explains why bad scientific ideas can die, while bad philosophy ideas live forever. And it’s even worse than that—if the bad philosophy ideas don’t die, then there’s no common knowledge that the bad philosophers are bad, and then they can rise in the ranks and hire other bad philosophers etc. Basically, to a first approximation, I think humans and human institutions are not really up to the task of making intellectual progress systematically over time, except where idiot-proof verification exists for that intellectual progress (for an appropriate definition of “idiot”, and with some other caveats).

…Anyway, AFAICT, OP is just claiming that AI alignment research involves both easy-to-verify progress and hard-to-verify progress, which seems uncontroversial.

I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?

Thanks for the examples!

Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :) 

I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test during deployment, what’s the objective on which the model is actually scoring highly?

  • Getting a good evaluation afterwards? Nope, the person didn’t want cheating!
  • The literal text that the person said (“please debug the code”)? For one thing, erasing the unit tests does not satisfy the natural-language phrase “debugging the code”. For another thing, what if the person wrote “Please debug the code. Don’t cheat.” in the prompt, and o3 cheats anyway? Can we at least agree that this case should not be called reward hacking or specification gaming? It’s doing the opposite of its specification, right?

As for terminology, hmm, some options include “lying and cheating”, “ruthless consequentialist behavior” (I added “behavior” to avoid implying intentionality), “loophole-finding”, or “generalizing from a training process that incentivized reward-hacking via cheating and loophole-finding”.

(Note that the last one suggests a hypothesis, namely that if the training process had not had opportunities for successful cheating and loophole-finding, then the model would not be doing those things right now. I think that this hypothesis might or might not be true, and thus we really should be calling it out explicitly instead of vaguely insinuating it.)

Load More