All of Steven Byrnes's Comments + Replies

There's something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus... it's not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me woul

... (read more)

While it's obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that "the human brain is not just LLM-like prediction", while you seem to be saying that "the human brain does not do LLM-like prediction at all". (Of course, "LLM-like prediction" is a vague concept and maybe we're just using it differently and ultimately agree.)

I disagree with whether that distinction matters:

I think technical discussions of A... (read more)

(Sorry in advance if this whole comment is stupid, I only read a bit of the report.)

As context, I think the kind of technical plan where we reward the AI for (apparently) being helpful is at least not totally doomed to fail. Maybe I’m putting words in people’s mouths, but I think even some pretty intense doomers would agree with the weak statement “such a plan might turn out OK for all we know” (e.g. Abram Demski talking about a similar situation here, Nate Soares describing a similar-ish thing as “maybe one nine” a.k.a. a mere 90% chance that it would fai... (read more)

3Joe Carlsmith11d
Agents that end up intrinsically motivated to get reward on the episode would be "terminal training-gamers/reward-on-the-episode seekers," and not schemers, on my taxonomy. I agree that terminal training-gamers can also be motivated to seek power in problematic ways (I discuss this in the section on "non-schemers with schemer-like traits"), but I think that schemers proper are quite a bit scarier than reward-on-the-episode seekers, for reasons I describe here.
1Rubi Hudson21d
I don't find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be "give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3". If you consider this all part of the training process (and I think that's a fair characterization),  model that starts with goal misgeneralization quickly becomes a schemer too.

GPT-4 is different from APTAMI. I'm not aware of any method that starts with movies of humans, or human-created internet text, or whatever, and then does some kind of ML, and winds up with a plausible human brain intrinsic cost function. If you have an idea for how that could work, then I'm skeptical, but you should tell me anyway. :)

“Extract from the brain” how? A human brain has like 100 billion neurons and 100 trillion synapses, and they’re generally very difficult to measure, right? (I do think certain neuroscience experiments would be helpful.) Or do you mean something else?

I meant "extract" more figuratively than literally. For example, GPT-4 seems to have acquired some ability to do moral reasoning in accordance with human values. This is one way to (very indirectly) "extract" information from the human brain.

I would say “the human brain’s intrinsic-cost-like-thing is difficult to figure out”. I’m not sure what you mean by “…difficult to extract”. Extract from what?

Extract from the brain into, say, weights in an artificial neural network, lines of code, a natural language "constitution", or something of that nature.

The “similar reason as why I personally am not trying to get heroin right now” is “Example 2” here (including the footnote), or a bit more detail in Section 9.5 here. I don’t think that involves an idiosyncratic anti-heroin intrinsic cost function.

The question “What is the intrinsic cost in a human brain” is a topic in which I have a strong personal interest. See Section 2 here and links therein. “Why don’t humans have an alignment problem” is sorta painting the target around the arrow I think? Anyway, if you radically enhanced human intelligence and let t... (read more)

So, the human brain's pseudo-intrinsic cost is not intractably complex, on your view, but difficult to extract.

I agree that it would be nice to get to a place where it is known (technically) how to make a kind AGI, but nobody knows how to make an unkind AGI. That's what you're saying, right? If so, yes that would be nice, but I see it as extraordinarily unlikely. I’m optimistic that there are technical ways to make a kind AGI, but I’m unaware of any remotely plausible approach to doing so that would not be straightforwardly modifiable to turn it into a way to make an unkind AGI.

3Luke H Miles4mo
It is just as ambitious/implausible as you say. I am hoping to get out some rough ideas in my next post anyways.

I’m not sure what you think my expectations are. I wrote “I am not crazy to hope for whole primate-brain connectomes in the 2020s and whole human-brain connectomes in the 2030s, if all goes well.“ That’s not the same as saying “I expect those things”; it’s more like “those things are not completely impossible”. I’m not an expert but my current understanding is (1) you’re right that existing tech doesn’t scale well enough (absent insane investment of resources), (2) it’s not impossible that near-future tech could scale much better than current tech. I’... (read more)

I find it interesting how he says that there is no such thing as AGI, but acknowledges that machines will "eventually surpass human intelligence in all domains where humans are intelligent" as that would meet most people's definition of AGI.

The somewhat-reasonable-position-adjacent-to-what-Yann-believes would be: “I don’t like the term ‘AGI’. It gives the wrong idea. We should use a different term instead. I like ‘human-level AI’.”

I.e., it’s a purely terminological complaint. And it’s not a crazy one! Lots of reasonable people think that “AGI” was a poorly... (read more)

I think that if you read the later Intro to Brain-Like AGI Safety series, then the only reason you might want to read this post (other than historical interest) is that the section “Dopamine category #2: RPE for “local” sub-circuit rewards” is talking about a topic that was omitted from Intro to Brain-Like AGI Safety (for brevity).

For example, practically everything I said about neuroanatomy in this post is at least partly wrong and sometimes very wrong. (E.g. the “toy loop model” diagrams are pretty bad.) The “Finally, the “prediction” part of reward pred... (read more)

Follow-up: More specific example of classic goal misgeneralization in the context of process-based supervision (more details here):

Suppose the AI is currently thinking about what to propose for step 84 of an ongoing plan to found a company. Ideally, according to the philosophy of process-based supervision, the only thing the AI “cares about” right now is getting a good score in the event that this particular proposal gets immediately audited. But the AI can't read the supervisor's mind, so figuring out how to get a good score in that hypothetical audit can... (read more)

Just noting that these seem like valid points! (Apologies for slow reply!) 

This post doesn’t talk about classic goal misgeneralization, e.g. the CoinRun paper (see also Rob Miles explainer). If X = “get the coin” is what we want the AI to desire, and Y = “get to the right side” is something else, but Y and X are correlated in the training distribution, then we can wind up with an AI that’s trying to do Y rather than X, i.e. misalignment. (Or maybe it desires some mix of X & Y. But that counts as misalignment too.)

That problem can arise without the AI having any situational awareness, or doing any “obfuscation”. The AI doesn’t... (read more)

5Steve Byrnes4mo
Follow-up: More specific example of classic goal misgeneralization in the context of process-based supervision (more details here): Suppose the AI is currently thinking about what to propose for step 84 of an ongoing plan to found a company. Ideally, according to the philosophy of process-based supervision, the only thing the AI “cares about” right now is getting a good score in the event that this particular proposal gets immediately audited. But the AI can't read the supervisor's mind, so figuring out how to get a good score in that hypothetical audit can be kinda tricky. The AI will presumably take advantage of whatever imperfect heuristics / proxies it can find, when they help it get a higher score. And here's one such proxy: “If the proposed step is actually carried out, will it help the project in the long-term?” This is not a perfect proxy for success-if-audited! Two reasons it's not perfect are: (1) the supervisor could be mistaken about whether it will help the project or not; (2) the supervisor presumably cares about other things unrelated to project success, like following social norms. But “it's not a perfect proxy” is different from “it is totally useless as a proxy”. It is in fact useful, and indeed probably irreplaceably useful, in the sense that ignoring that information would probably make the AI’s prediction of audit results strictly worse. So we should expect the AI to not only try to guess the long-term real-world consequences of its proposals (if carried out), but also to “care about” those consequences, in the sense of allowing those guesses to shape what proposals it offers in the first place. OK, so now we have an AI that (among other things) “cares about” stuff that will happen in the future, well after the current step. And that “caring” is actually helping the AI to get better scores. So we’re (unintentionally) training that caring into the AI, not training it out, even in the absence of any AI obfuscation. It's still a misaligned mot

Seems false in RL, for basically the reason you said (“it’s not clear how to update a model towards performing the task if it intentionally tries to avoid showing us any task-performing behavior”). In other words, if we’re doing on-policy learning, and if the policy never gets anywhere close to a reward>0 zone, then the reward>0 zone isn’t doing anything to shape the policy. (In a human analogy, I can easily avoid getting addicted to nicotine by not exposing myself to nicotine in the first place.)

I think this might be a place where people-thinking-ab... (read more)

investing in government money market funds earning ~5% rather than 0% interest checking accounts

It’s easier than that—there are high-interest-rate free FDIC-eligible checking accounts. has a good list, although you might need to be a member to view it. As of this moment (2023-07-20), the top of their leaderboard is: Customers Bank (5.20% APY), BankProv (5.15%), BrioDirect (5.06%), UFB Direct (5.06%).

Thanks, that's a good link. In our case our assets significantly exceed the FDIC $250k insurance limit and there are operational costs to splitting assets across a large number of banks. But a high-interest checking account could be a good option for many small orgs.

I was just trying to replace “reward” by “reinforcement”, but hit the problem that “negative reward” makes sense, but behaviorist terminology is such that “reinforcement” is always after a good thing happens, including “negative reinforcement”, which would be a kind of positive reward that entails removing something aversive. The behaviorists use the word “punishment” for “negative reward”. But “punishment” has all the same downsides as “reward”, so I assume you’re also opposed to that. Unfortunately, if I avoid both “punishment” and “reward”, then it seems I have no way to unambiguously express the concept “negative reward”.

So “negative reward” it is. ¯\_(ツ)_/¯

4Alex Turner5mo
Yeah, seems tough to avoid "reward" in that situation. Thanks for pointing this out.

Nice interview, kudos to you both!

One is a bunch of very simple hardwired genomically-specified reward circuits over stuff like your sensory experiences or simple correlates of good sensory experiences. 

I just want to flag that the word “simple” is contentious in this context. The above excerpt isn’t a specific claim (how simple is “simple”, and how big is “a bunch”?) so I guess I neither agree nor disagree with it as such. But anyway, my current guess (see here) is that reward circuits might effectively comprise tens of thousands of lines of pseudoco... (read more)

Thanks for your detailed comments!

you have many cortices

This part isn’t quite right. Here’s some background if it helps.

Part of your brain is a big sheet of gray matter called “the cortex”. In humans, the sheet gets super-crumpled up in the brain, so much so that it’s easy to forget that it’s a single contiguous sheet in the first place. Also in humans, the sheet gets so big that the outer edges of it wind up curved up underneath the center part, kinda like the top of a cupcake or muffin that overflows its paper wrapper.

(See here if you can’t figure out what I’m talking about with the cupc... (read more)

OK! I think I’m on board now.

Let me try to explain “process-based feedback” from first principles in my own words.

We have a problem: if an agent wants to do X in the real world, dastardly real-world power-seeking actions are probably helpful for that.

The very hard manifestation of this problem is: there could be an AI that has never done any real-world power-seeking actions whatsoever, not even a little, not anytime during training, and then seemingly out of nowhere it does a power-seeking treacherous turn (maybe it outputs “Help me help me I’m suffering!”... (read more)

This sounds right to me! Only note is that I think the setup can be simplified a bit. The central idea I have in mind is that the AI does something like: 1. "Think" about what to do next, for up to some max period of time ("what to do next" can be "think more, with prompt X"). 2. Do it 3. Repeat This seems like a pretty natural way for an "agent" to operate, and then every #1 is an "auditable step" in your terminology. (And the audits are done by comparing a few rollouts of that step, and performing gradient descent without executing any of them.) There are probably subtleties I'm missing, but I think this points pretty well at what I tend to think of as the hopes of process-based supervision.

OK, I’ll try to construct an example of process-based supervision without boxing, and argue that it fails the criterion of “never giving gradient updates toward doing the dastardly stuff”.

We give our AI unfettered unmonitored internet access. We say “We are a hardware R&D firm, and we would like to develop a less-expensive LiDAR unit.” The AI does some internet searches and reads some books and outputs:

“My first step is to list out different LiDAR approaches on the market (and under development), and try to estimate their manufacturing cost breakdowns,... (read more)

As far as I can tell, this example is hard to fit into the more specific description I gave of process-based supervision way upthread - specifically, see step 4: In this case, let's presume that we are auditing a step that is taking place after the AI got the paywalled textbook. In this case, it might propose two different next steps that are both informed by the paywalled textbook, and we pick the one we like better.  This doesn't seem to differentially reinforce the behavior of going for the paywalled textbook. (Whereas if we are auditing an earlier stage where it was considering going for the paywalled textbook, we can reinforce against this.)

Sorry. Thanks for your patience. When you write:

Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.

…I don’t know what a “step” is.

As above, if I sit on my couch staring into space brainstorming for an hour and then write down ... (read more)

I'm not intending to use Def'n 2 at all. The hope here is not that we can "rest assured that there is no dangerous consequentialist means-end reasoning" due to e.g. it not fitting into the context in question. The hope is merely that if we don't specifically differentially reinforce unintended behavior, there's a chance we won't get it (even if there is scope to do it). I see your point that consistently, effectively "boxing" an AI during training could also be a way to avoid reinforcing behaviors we're worried about. But they don't seem the same to me: I think you can get the (admittedly limited) benefit of process-based supervision without boxing. Boxing an AI during training might have various challenges and competitiveness costs. Process-based supervision means you can allow an unrestricted scope of action, while avoiding specifically reinforcing various unintended behaviors. That seems different from boxing.

OK, I think this is along the lines of my other comment above:

I also think it’s good (for safety) to try to keep the AI from manipulating the real world and seeing the consequences within a single “AI does some stuff” step, i.e. Example 2 is especially bad in a way that neither Examples 0 nor 1 are. I think we’re in agreement here too.

Most of your reply makes me think that what you call “process-based supervision” is what I call “Put the AI in a box, give it tasks that it can do entirely within the box, prevent it from escaping the box (and penalize it if ... (read more)

I think it is not at all about boxing - I gave the example I did to make a clear distinction with the "number of steps between audits" idea. For the distinction with boxing, I'd focus on what I wrote at the end: "The central picture of process-based feedback isn’t either of these, though - it’s more like 'Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.'"

Hmm. I think “process-based” is a spectrum rather than a binary.

Let’s say there’s a cycle:

  • AI does some stuff P1
  • and then produces a human-inspectable work product O1
  • AI does some stuff P2
  • and then produces a human-inspectable work product O2

There’s a spectrum based on how long each P cycle is:

Example 1 (“GPT with process-based supervision”):

  • “AI does some stuff” is GPT-3 running through 96 serial layers of transformer-architecture computations.
  • The “human-inspectable work product” is GPT-3 printing a token and we can look at it and decide if we’re happy about
... (read more)
I don't think of process-based supervision as a totally clean binary, but I don't think of it as just/primarily being about how many steps you allow in between audits. I think of it as primarily being about whether you're doing gradient updates (or whatever) based on outcomes (X was achieved) or processes (Y seems like a well-reasoned step to achieve X). I think your "Example 0" isn't really either - I'd call it internals-based supervision.  I agree it matters how many steps you allow in between audits, I just think that's a different distinction. Here’s one way to think about it. Say option A is to give the AI 10 years to produce a plan to make money (with a very restricted action space, e.g. searching and reading lit from an airgapped database, and maintaining a scratchpad), then rate whether the plan seems like a good (ethically as well as practically) plan to make money; whereas option B is to audit the AI every hour, based on how much money is in a bank account. Here option B has shorter cycles, but also seems more dangerous. Does that help illustrate the difference I'm trying to point at? (The central picture of process-based feedback isn’t either of these, though - it’s more like “Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.”)  

I think I see where you’re coming from but I generally have mixed feelings, and am going back and forth but leaning towards sticking with textbook terminology for my part.

Once we fix the policy network and sampling procedure, we get a mapping from observations…to probability distributions over outputs…. This mapping  is the policy.…

Of course, a policy could in fact be computed using internal planning (e.g. depth-3 heuristic search) to achieve an internally represented goal (e.g. number of diamonds predicted to be present.) I think it's appropria

... (read more)
4Alex Turner6mo
I think there's a way better third alternative: asking each reader to unilaterally switch to "policy." No coordination, no constant reminders, no communication difficulties (in my experience). I therefore don't see a case for using "agent" in the mentioned cases.  I added to the post:

I think the basic idea of instrumental convergence is just really blindingly obvious, and I think it is very annoying that there are people who will cluck their tongues and stroke their beards and say "Hmm, instrumental convergence you say? I won't believe it unless it is in a very prestigious journal with academic affiliations at the top and Computer Modern font and an impressive-looking methods section."

I am happy that your papers exist to throw at such people.

Anyway, if optimal policies tend to seek power, then I desire to believe that optimal policies ... (read more)

So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result. 

Ohh, sorry you had to tell me twice, but maybe I’m finally seeing where we’re talking past each other.

Back to the OP, you wrote:

  • In training, an AI system gets tasks of the form “Produce a plan to accomplish X that looks good to humans” (not tasks of the form “accomplish X”).
  • The AI system is rewarded based on whether the plan makes sense and looks good to humans - not how
... (read more)
Hm, I think we are probably still missing each other at least somewhat (and maybe still a lot), because I don't think the interpretability bit is important for this particular idea - I think you can get all the juice from "process-based supervision" without any interpretability. I feel like once we sync up you're going to be disappointed, because the benefit of "process-based supervision" is pretty much just that you aren't differentially reinforcing dangerous behavior. (At worst, you're reinforcing "Doing stuff that looks better to humans than it actually is." But not e.g. reward hacking.) The question is, if you never differentially reinforce dangerous unintended behavior/aims, how does dangerous behavior/aims arise? There are potential answers - perhaps you are inadvertently training an AI to pursue some correlate of "this plan looks good to a human," leading to inner misalignment - but I think that most mechanistic stories you can tell from the kind of supervision process I described (even without interpretability) to AIs seeking to disempower humans seem pretty questionable - at best highly uncertain rather than "strong default of danger." This is how it seems to me, though someone with intuitions like Nate's would likely disagree.

Thanks, that all makes sense.

I agree you can still get a problem from goal misgeneralization and instrumental reasoning, but this seems noticeably less likely (assuming process-based training) than getting a problem from reinforcing pursuit of unintended outcomes. ( has some discussion.) I put significant credence on something like "Internals-based training doesn't pan out, but neither does the concern about goal misgeneralization and instrumental rea

... (read more)
I think that's a legit disagreement. But I also claim that the argument I gave still works if you assume that AI is trained exclusively using RL - as long as that RL is exclusively "process-based." So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result.  It still seems, here, like you're not reinforcing unintended behaviors, so the concern comes exclusively from the kind of goal misgeneralization you'd get without having any particular reason to believe you are reinforcing it. Does that seem reasonable to you? If so, why do you think making RL more central makes process-based supervision less interesting? Is it basically that in a world where RL is central, it's too uncompetitive/practically difficult to stick with the process-based regime?

That was helpful, thanks! I will try to rephrase. Let’s consider two failure modes. First, define some terms:

X is what we want the AI to be trying to do:

  • In outcome-based training, X = “make money ethically”
  • In process-based training, X = “produce a good plan to make money ethically”

Y is something we want the AI to not try to do:

  • In outcome-based training, maybe Y = “secretly steal money”
  • In process-based training, maybe Y = “produce a deliberately deceptive plan” or “hack out of the box”.

Now consider two failure modes.


  • We were hoping for the AI
... (read more)
Some reactions on your summary: This feels sort of off as a description - what actually might happen is that it takes a bunch of actual steps to make money ethically, but steps are graded based on audits of whether they seem reasonable without the auditor knowing the outcome. The latter is the bigger concern, unless you mean the former as aimed at something like the latter. E.g., producing a "plan that seems better to us than it is" seems more likely to get reinforced by this process, but is also less scary, compared to doing something that manipulates and/or disempowers humans. Or AI does a moderate-to-large amount of  Y competently and successfully. Process-based training still doesn't seem like it would reinforce that behavior in the sense of making it more likely in the future, assuming the Y is short of something like "Hacks into its own reinforcement system to reinforce the behavior it just did" or "Totally disempowers humanity." I don't think you need near-perfect rewards. The mistakes reinforce behaviors like "Do things that a silly human would think are reasonable steps toward the goal", not behaviors like "Manipulate the world into creating an appearance that the goal was accomplished." If we just get a whole lot of the former, that doesn't seem clearly worse than humans just continuing to do everything. This is a pretty central part of the hope. I agree you can still get a problem from goal misgeneralization and instrumental reasoning, but this seems noticeably less likely (assuming process-based training) than getting a problem from reinforcing pursuit of unintended outcomes. ( has some discussion.) I put significant credence on something like "Internals-based training doesn't pan out, but neither does the concern about goal misgeneralization and instrumental reasoning (in the context of process-based training, ie in the context of not reinforc

Sure. That excerpt is not great.

3Alex Turner7mo
(I do think that animals care about the reinforcement signals and their tight correlates, to some degree, such that it's reasonable to gloss it as "animals sometimes optimize rewards." I more strongly object to conflating what the animals may care about with the mechanistic purpose/description of the RL process.)

I'd consider this to be one of the more convincing reasons to be hesitant about a pause (as opposed to the 'crying wolf' argument, which seems to me like a dangerous way to think about coordinating on AI safety?). 

Can you elaborate on this? I think it’s incredibly stupid that people consider it to be super-blameworthy to overprepare for something that turned out not to be a huge deal—even if the expected value of the preparation was super-positive given what was known at the time. But, stupid as it may be, it does seem to be part of the situation we’r... (read more)

Maybe - I can see it being spun in two ways: 1. The AI safety/alignment crowd was irrationally terrified of chatbots/current AI, forced everyone to pause, and then, unsurprisingly, didn't find anything scary 2. The AI safety/alignment crowd need time to catch up their alignment techniques to keep up with the current models before things get dangerous in the future, and they did that To point (1): alignment researchers aren't terrified of GPT-4 taking over the world, wouldn't agree to this characterization, and are not communicating this to others. I don't expect this is how things will be interpreted if people are being fair. I think (2) is the realistic spin, and could go wrong reputationally (like in the examples you showed) if there's no interesting scientific alignment progress made in the pause-period.  I don't expect there to be a lack of interesting progress, though. There's plenty of unexplored work in interpretability alone that could provide many low-hanging fruit results. This is something I naturally expect out of a young field with a huge space of unexplored empirical and theoretical questions. If there's plenty of alignment research output during that time, then I'm not sure the pause will really be seen as a failure.  Yeah, agree. I'd say one of the best ways to do this is to make it clear what the purpose of the pause is and defining what counts as the pause being a success (e.g. significant research output). Also, your pro-pause points seem quite important, in my opinion, and outweigh the 'reputational risks' by a lot: * Pro-pause: It’s “practice for later”, “policy wins beget policy wins”, etc., so it will be easier next time * Pro-pause: Needless to say, maybe I’m wrong and LLMs won’t plateau! I'd honestly find it a bit surprising if the reaction to this was to ignore future coordination for AI safety with a high probability. "Pausing to catch up alignment work" doesn't seem like the kind of thing which leads the world to think "AI can

I think that’s one consideration, but I think there are a bunch of considerations pointing in both directions. For example:

Pause in scaling up LLMs → less algorithmic progress:

  • The LLM code-assistants or research-assistants will be worse
  • Maybe you can only make algorithmic progress via doing lots of GPT-4-sized training runs or bigger and seeing what happens
  • Maybe pause reduces AI profit which would otherwise be reinvested in R&D

Pause in scaling up LLMs → more algorithmic progress:

  • Maybe doing lots of GPT-4-sized training runs or bigger is a distraction fr
... (read more)
I'd consider this to be one of the more convincing reasons to be hesitant about a pause (as opposed to the 'crying wolf' argument, which seems to me like a dangerous way to think about coordinating on AI safety?).  I don't have a good model for how much serious effort is currently going into algorithmic progress, so I can't say anything confidently there - but I would guess there's plenty and it's just not talked about?  It might be a question about which of the following two you think will most likely result in a dangerous new paradigm faster (assuming LLMs aren't the dangerous paradigm): 1. current amount of effort put into algorithmic progress + amplified by code assistants, apps, tools, research-assistants, etc. 2. counterfactual amount of effort put into algorithmic progress if a pause happens on scaling I think I'm leaning towards (1) bringing about a dangerous new paradigm faster because * I don't think the counterfactual amount of effort on algorithmic progress will be that much more significant than the current efforts (pretty uncertain on this, though) * I'm weary of adding faster feedback loops to technological progress/allowing avenues for meta-optimizations to humanity since these can compound

The description doesn't seem so bad to me. Your post "Reward is not the optimization target" is about what actual RL algorithms actually do. The wiki descriptions here are a kind of normative motivation as to how people came to be looking into those algorithms in the first place. Like, if there's an RL algorithm that performs worse than chance at getting a high reward, then that ain't an RL algorithm. Right? Nobody would call it that.

I think lots of families of algorithms are likewise lumped together by a kind of normative "goal", even if any given algorit... (read more)

4Alex Turner7mo
I agree that it is narrowly technically accurate as a description of researcher motivation. Note that they don't offer any other explanation elsewhere in the article. Also note that they also make empirical claims:

I think that’s what I said in the last paragraph of the comment you’re responding to:

(On a different topic, self-supervised pre-training before supervised fine-tuning is almost always better than supervised learning from random initialization, as far as I understand. Presumably if someone were following the OP protocol, which involves a supervised learning step, then they would follow all the modern best practices for supervised learning, and “start from a self-supervised-pretrained model” is part of those best practices.)

Maybe that’s what PeterMcCluskey w... (read more)

Sure, we can take some particular model-based RL algorithm (MuZero, APTAMI, the human brain algorithm, whatever), but instead of “the reward function” we call it  “function #5829”, and instead of “the value function” we call it “function #6241”, etc. If you insist that I use those terms, then I would still be perfectly capable of describing step-by-step why this algorithm would try to kill us. That would be pretty annoying though. I would rather use the normal terms.

I’m not quite sure what you’re talking about (“projected from the labeled world model”... (read more)

Then ask the transformer to rate on a numeric scale how positively or negatively a human would feel in any particular situation.

I'm still confused. Here you're describing what you're hoping will happen at inference time. I'm asking how it's trained, such that that happens. If you have a next-frame video predictor, you can't ask it how a human would feel. You can't ask it anything at all - except "what might be the next frame of thus-and-such video?". Right?

I wonder if you've gotten thrown off by chatGPT etc. Those are NOT trained by SSL, and therefore N... (read more)

Not exactly. You can extract embeddings from a video predictor (activations of the next-to-last layer may do, or you can use techniques, which enhance semantic information captured in the embeddings). And then use supervised learning to train a simple classifier from an embedding to human feelings on a modest number of video/feelings pairs.

I think of my specialty as mostly “trying to solve the alignment problem for model-based RL”. (LeCun’s paper is an example of model-based RL.) I think that’s a somewhat different activity than, say, “trying to solve the alignment problem for LLMs”. Like, I read plenty of alignmentforum posts on the latter topic, and I mostly don’t find them very relevant to my work. (There are exceptions.) E.g. the waluigi effect is not something that seems at all relevant to my work, but it’s extremely relevant to the LLM-alignment crowd. Conversely, for example, here’s a... (read more)

1Ben Amitay7mo
I see. I didn't fully adapt to the fact that not all alignment is about RL. Beside the point: I think those labels on the data structures are very confusing. Both the actor and the critic are very likely to have so specialized world models (projected from the labeled world model) and planning abilities. The values of the actor need not be the same as the output of the critic. And things value-related and planning-related may easily leak into the world model if you don't actively try to prevent it. So I suspect that we should ignore the labels and focus on architecture and training methods.

I don't understand how the remaining technical problem is not basically the whole of the alignment problem

Yes. I don’t think the paper constitutes any progress on the alignment problem. (No surprise, since it talks about the problem for only a couple sentences.)

Hmm, maybe you’re confused that the title refers to “an unsolved technical alignment problem” instead of “the unsolved technical alignment problem”? Well, I didn’t mean it that way. I think that solving technical alignment entails solving a different (albeit related) technical problem for each diffe... (read more)

1Ben Amitay7mo
Yes, I think that was it; and that I did not (and still don't) understand what about that possible AGI architecture is non-trivial and has a non-trivial implementations for alignment, even if not ones that make it easier. It seem like not only the same problems carefully hidden, but the same flavor of the same problems on plain sight.

If I were trying to make this model work, I'd use mainly self-supervised learning that's aimed at getting the module to predict what a typical human would feel. 

I don’t follow. Can you explain in more detail? “Self-supervised learning” means training a model to predict some function / subset of the input data from a different function / subset of the input data, right? What’s the input data here, and what is the prediction target?

I haven't thought this out very carefully. I'm imagining a transformer trained both to predict text, and to predict the next frame of video. Train it on all available videos that show realistic human body language. Then ask the transformer to rate on a numeric scale how positively or negatively a human would feel in any particular situation. This does not seem sufficient for a safe result, but implies that LeCun is less nutty than your model of him suggests.

Thanks. That did actually occur to me, but I left it out because I wasn’t sure and didn’t want to go on an exhausting chase down every possible interpretation of the paper.

Anyway, if the input to the Prosociality Score Model is a set of latent variables rather than a set of pixels then:

  • My OP claim that there are two adversarial out-of-distribution generalization problems (in the absence of some creative solution not in the paper) is still true.
  • One of those two problems (OOD generalization of the Prosociality Score Model) might get less bad, although I don’
... (read more)

Whatever. Maybe I was just jumping on an excuse to chit-chat about possible limitations of LLMs :) And maybe I was thread-hijacking by not engaging sufficiently with your post, sorry.

This part you wrote above was the most helpful for me:

if the task is "spend a month doing novel R&D for lidar", then my framework predicts that we'll need 1-month AGI for that

I guess I just want to state my opinion that (1) summarizing a 10,000-page book is a one-month task but could come pretty soon if indeed it’s not already possible, (2) spending a month doing novel R&a... (read more)

6Richard Ngo7mo
Yeah, I agree I convey the implicit prediction that, even though not all one-month tasks will fall at once, they'll be closer than you would otherwise expect not using this framework. I think I still disagree with your point, as follows: I agree that AI will soon do passably well at summarizing 10k word books, because the task is not very "sharp" - i.e. you get gradual rather than sudden returns to skill differences. But I think it will take significantly longer for AI to beat the quality of summary produced by a median expert in 1 month, because that expert's summary will in fact explore a rich hierarchical interconnected space of concepts from the novel (novel concepts, if you will).

I think the “in one second” would be cheating. The question for Ed Witten didn’t specify “the AI can’t answer it in one second”, but rather “the AI can’t answer it period”. Like, if GPT-4 can’t answer the string theory question in 5 minutes, then it probably can’t answer it in 1000 years either.

(If the AI can get smarter and smarter, and figure out more and more stuff, without bound, in any domain, by just running it longer and longer, then (1) it would be quite disanalogous to current LLMs [btw I’ve been assuming all along that this post is implicitly ima... (read more)

4Richard Ngo7mo
Why is it cheating? That seems like the whole point of my framework - that we're comparing what AIs can do in any amount of time to what humans can do in a bounded amount of time.

Ah, that’s helpful, thanks.

Sure, there are some questions that don't appear at all on the internet, but most human knowledge is, so you'd have to cherry-pick questions.

I think you’re saying “there are questions about string theory whose answers are obvious to Ed Witten because he happened to have thought about them in the course of some unpublished project, but these questions are hyper-specific, so bringing them up at all would be unfair cherry-picking.”

But then we could just ask the question: “Can you please pose a question about string theory that no AI... (read more)

5Richard Ngo7mo
But can't we equivalently just ask an AI to pose a question that no human would have a prayer of answering in one second? It wouldn't even need to be a trivial memorization thing, it could also be a math problem complex enough that humans can't do it that quickly, or drawing a link between two very different domains of knowledge.

(expanding on my reply to you on twitter)

For the t-AGI framework, maybe you should also specify that the human starts the task only knowing things that are written multiple times on the internet. For example, Ed Witten could give snap (1-second) responses to lots of string theory questions that are WAY beyond current AI, using idiosyncratic intuitions he built up over many years. Likewise a chess grandmaster thinking about a board state for 1 second could crush GPT-4 or any other AI that wasn’t specifically and extensively trained on chess by humans.

A star... (read more)

How long would it take (in months) to train a smart recent college graduate with no specialized training in my field to complete this task?

This doesn't seem like a great metric because there are many tasks that a college grad can do with 0 training that current AI can't do, including:

  • Download and play a long video game to completion
  • Read and summarize a whole book
  • Spend a month planning an event

I do think that there's something important about this metric, but I think it's basically subsumed by my metric: if the task is "spend a month doing novel R&D for... (read more)

I don't know what probability you mean with "not crazy"

Me neither. I’m not close enough to the technical details to know. I did run that particular sentence by a guy who’s much more involved in the field before I published, and he said it was a good sentence, but only because “not crazy to hope for X” is a pretty weak claim.

After reading the post "Whole Brain Emulation: No Progress on C. elegans After 10 Years"

Yeah, the C. elegans connectome has been known for a very long time. The thing that’s hard for C. elegans is going from the connectome to WBE. As ... (read more)

When I started blogging about AI alignment in my free time, it happened that GPT-2 had just come out, and everyone on LW was talking about it. So I wrote a couple blog posts (e.g. 1,2) trying (not very successfully, in hindsight, but I was really just starting out, don’t judge) to think through what would happen if GPT-N could reach TAI / x-risk levels. I don’t recall feeling strongly that it would or wouldn’t reach those levels, it just seemed like worth thinking about from a safety perspective and not many other people were doing so at the time. But in t... (read more)

Belief propagation is the kind of thing that most people wouldn't work on in an age before computers. It would be difficult to evaluate/test, but more importantly wouldn't have much hope for application.

Hmm. I’m not sure I buy that. Can’t we say the same thing about FFT? Doing belief prop by hand doesn’t seem much so different from doing an FFT by hand; and both belief prop and FFT were totally doable on a 1960s mainframe, if not earlier, AFAICT. But the modern FFT algorithm was published in 1965, and people got the gist of it in 1942, and 1932, and even G... (read more)

I guess you’re referring to my comment “for my part, if I believed that (A)-type systems were sufficient for TAI—which I don’t—then I think I would feel slightly less concerned about AI x-risk than I actually do, all things considered!”

I’m not implicitly comparing (A) to “completely unknown mystery algorithm”, instead I’m implicitly comparing (A) to “brain-like AGI, or more broadly model-based RL AGI, or even more broadly some kind of AGI that incorporates RL in a much more central way than LLMs do”. I think, by and large, more RL (as opposed to super... (read more)

Oh, I somehow missed that your original question was about takeoff speeds. When you wrote “algorithmic insights…will lead to dramatically faster AI development”, I misread it as “algorithmic insights…will lead to dramatically more powerful AIs”. Oops. Anyway, takeoff speeds are off-topic for this post, so I won’t comment on them, sorry. :)

In this post, I’m not trying to convert people to LLM plateau-ism. I only mentioned my own opinions as a side-comment + short footnote with explicitly no justification. And if I were trying to convert people to LLM plateau-ism, I would certainly not attempt to do so on the basis of my AI forecasting track record, which is basically nonexistent.  :)

2Vivek Hebbar7mo
It would still be interesting to know whether you were surprised by GPT-4's capabilities (if you have played with it enough to have a good take)

Can you flesh out your view of how the community is making "slow but steady progress right now on getting ready"?

  • I finished writing this less than a year ago, and it seems to be meaningfully impacting a number of people’s thinking, hopefully for the better. I personally feel strongly like I’m making progress on a worthwhile project and would like lots more time to carry it through, and if it doesn’t work out I have others in the pipeline. I continue to have ideas at a regular clip that I think are both important and obvious-in-hindsight, and to notice new
... (read more)
Belief propagation is the kind of thing that most people wouldn't work on in an age before computers. It would be difficult to evaluate/test, but more importantly wouldn't have much hope for application. Seems to me it arrived at a pretty normal time in our world. What do you think of diffusion planning?
Thanks! I agree with you about all sorts of AI alignment essays being interesting and seemingly useful. My question was more about how to measure the net rate of AI safety research progress. But I agree with you that an/your expert inside view of how insights are accumulating is a reasonable metric. I also agree with you that the acceptance of TAI x-risk in the ML community as a real thing is useful and that - while I am slightly worried about the risk of overshooting, like Scott Alexander describes - this situation seems to be generally improving.  Regarding (2), my question is why algorithmic growth leading to serious growth of AI capabilities would be so discontinuous. I agree that RL is much better in humans than in machines, but I doubt that replicating this in machines would require just one or a few algorithmic advances. Instead, my guess, based on previous technology growth stories I've read about, is that AI algorithmic progress is likely to occur due to the accumulation of many small improvements over time. 
Load More