All of HoldenKarnofsky's Comments + Replies

Just noting that these seem like valid points! (Apologies for slow reply!) 

This sounds right to me!

Only note is that I think the setup can be simplified a bit. The central idea I have in mind is that the AI does something like:

  1. "Think" about what to do next, for up to some max period of time ("what to do next" can be "think more, with prompt X").
  2. Do it
  3. Repeat

This seems like a pretty natural way for an "agent" to operate, and then every #1 is an "auditable step" in your terminology. (And the audits are done by comparing a few rollouts of that step, and performing gradient descent without executing any of them.)

There are probably subt... (read more)

A way this could be an issue would be if we are auditing at a stage before it gets the paywalled textbook, and when we "generate two alternative actions for us to choose between", we are allowing that generation process to involve a bunch of browsing and searching that we don't inquire into.

But we could require "generate two alternative actions for us to choose between" to be pretty constrained (e.g., not involve interacting with browsers and other tools), and this would still not be the same as boxing because we would still be auditing an unconstrained sequence of actions (while constraining the generation of alternatives while doing the audit).

OK! I think I’m on board now.

Let me try to explain “process-based feedback” from first principles in my own words.

We have a problem: if an agent wants to do X in the real world, dastardly real-world power-seeking actions are probably helpful for that.

The very hard manifestation of this problem is: there could be an AI that has never done any real-world power-seeking actions whatsoever, not even a little, not anytime during training, and then seemingly out of nowhere it does a power-seeking treacherous turn (maybe it outputs “Help me help me I’m suffering!”... (read more)

As far as I can tell, this example is hard to fit into the more specific description I gave of process-based supervision way upthread - specifically, see step 4:

4. With chance (1-p), the step is simply executed, with no gradient descent implications. With chance p,  we prompt the AI to generate a number of alternative next steps; drill down extensively on its reasoning; and perform gradient descent based on which of the alternative next steps we like best. (We could potentially then execute a weighted random suggested step, rather than the AI's first-

... (read more)
A way this could be an issue would be if we are auditing at a stage before it gets the paywalled textbook, and when we "generate two alternative actions for us to choose between", we are allowing that generation process to involve a bunch of browsing and searching that we don't inquire into. But we could require "generate two alternative actions for us to choose between" to be pretty constrained (e.g., not involve interacting with browsers and other tools), and this would still not be the same as boxing because we would still be auditing an unconstrained sequence of actions (while constraining the generation of alternatives while doing the audit).

I'm not intending to use Def'n 2 at all. The hope here is not that we can "rest assured that there is no dangerous consequentialist means-end reasoning" due to e.g. it not fitting into the context in question. The hope is merely that if we don't specifically differentially reinforce unintended behavior, there's a chance we won't get it (even if there is scope to do it).

I see your point that consistently, effectively "boxing" an AI during training could also be a way to avoid reinforcing behaviors we're worried about. But they don't seem the same to me: I t... (read more)

2Steve Byrnes8mo
OK, I’ll try to construct an example of process-based supervision without boxing, and argue that it fails the criterion of “never giving gradient updates toward doing the dastardly stuff”. We give our AI unfettered unmonitored internet access. We say “We are a hardware R&D firm, and we would like to develop a less-expensive LiDAR unit.” The AI does some internet searches and reads some books and outputs: “My first step is to list out different LiDAR approaches on the market (and under development), and try to estimate their manufacturing cost breakdowns, to get a better sense for this space.” We inspect that output, and compare it to the AI’s other possible outputs, and give the AI a high grade. So far so good, right? No!! Because the AI was secretly doing some dastardly stuff during the “The AI does some internet searches and reads some books” part: Specifically, the AI hacked into a website to get access to a paywalled e-textbook entitled Manufacturing Cost Reduction 101, and read it, and used that information to propose a better (so-called) “first step”.  …And we rewarded it for that. (What we wanted was for the AI to propose to read that e-textbook as a first step. But I don’t see how we’re incentivizing that, in the absence of boxing. The “first step” it actually proposed would at least plausibly be judged as better than that.) I think there’s a pretty smooth gradient from legal information-gathering internet queries (downloading un-paywalled articles) to dubiously-legal (sci-hub) to illegal (logging into Elsevier with a guessed password) to downright dastardly (stealing money and hiring a taskrabbit to go to a library and scan a book and post it on the internet). It seems to me that we’re incentivizing the AI all the way down that gradient. I’m sure I’m misunderstanding something, and appreciate your patience.

I agree that this is a major concern. I touched on some related issues in this piece.

This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere).

I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.

I think it is not at all about boxing - I gave the example I did to make a clear distinction with the "number of steps between audits" idea.

For the distinction with boxing, I'd focus on what I wrote at the end: "The central picture of process-based feedback isn’t either of these, though - it’s more like 'Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.'"

2Steve Byrnes9mo
Sorry. Thanks for your patience. When you write: …I don’t know what a “step” is. As above, if I sit on my couch staring into space brainstorming for an hour and then write down a plan, how many “steps” was that? 1 step or 1000s of steps? Hmm. I am concerned that the word “step” (and relatedly, “process”) is equivocating between two things: * Def'n 1: A “step” to be a certain amount of processing that leads to a sub-sub-plan that we can inspect / audit. * Def'n 2: A “step” is a sufficiently small and straightforward that inside of one so-called “step” we can rest assured that there is no dangerous consequentialist means-end reasoning, creative out-of-the-box brainstorming, strategizing etc. I feel like we are not entitled to use Def'n 2 without interpretability / internals-based supervision—or alternatively very very short steps as in LLMs maybe—but that you have been sneaking in Def'n 2 by insinuation. (Sorry if I’m misunderstanding.) Anyway, under Def'n 1, we are giving gradient updates towards agents that do effective means-end reasoning towards goals, right? Because that’s a good way to come up with a sub-sub-plan that human inspection / auditing will rate highly. So I claim that we are plausibly gradient-updating to make “within-one-step goal-seeking agents”. Now, we are NOT gradient-updating aligned agents to become misaligned (except in the fairly-innocuous “Writing outputs that look better to humans than they actually are” sense). That’s good! But it seems to me that we got that benefit entirely from the boxing. (I generally can’t think of any examples where “The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff” comes apart from boxing, that’s also consistent with everything else you’ve said.)

I don't think of process-based supervision as a totally clean binary, but I don't think of it as just/primarily being about how many steps you allow in between audits. I think of it as primarily being about whether you're doing gradient updates (or whatever) based on outcomes (X was achieved) or processes (Y seems like a well-reasoned step to achieve X). I think your "Example 0" isn't really either - I'd call it internals-based supervision. 

I agree it matters how many steps you allow in between audits, I just think that's a different distinction.

Here’... (read more)

4Steve Byrnes9mo
OK, I think this is along the lines of my other comment above: Most of your reply makes me think that what you call “process-based supervision” is what I call “Put the AI in a box, give it tasks that it can do entirely within the box, prevent it from escaping the box (and penalize it if you catch it trying), and hope that it doesn’t develop goals & strategies that involve trying to escape the box via generalization and situation awareness.” Insofar as that’s what we’re talking about, I find the term “boxing” clearer and “process-based supervision” kinda confusing / misleading. Specifically, in your option A (“give the AI 10 years to produce a plan…”): * my brain really wants to use the word “process” for what the AI is doing during those 10 years, * my brain really wants to use the word “outcome” for the plan that the AI delivers at the end. But whatever, that’s just terminology. I think we both agree that doing that is good for safety (on the margin), and also that it’s not sufficient for safety.  :) Separately, I’m not sure what you mean by “steps”. If I sit on my couch brainstorming for an hour and then write down a plan, how many “steps” was that?

Hm, I think we are probably still missing each other at least somewhat (and maybe still a lot), because I don't think the interpretability bit is important for this particular idea - I think you can get all the juice from "process-based supervision" without any interpretability.

I feel like once we sync up you're going to be disappointed, because the benefit of "process-based supervision" is pretty much just that you aren't differentially reinforcing dangerous behavior. (At worst, you're reinforcing "Doing stuff that looks better to humans than it actually ... (read more)

3Steve Byrnes9mo
Hmm. I think “process-based” is a spectrum rather than a binary. Let’s say there’s a cycle: * AI does some stuff P1 * and then produces a human-inspectable work product O1 * AI does some stuff P2 * and then produces a human-inspectable work product O2 * … There’s a spectrum based on how long each P cycle is: Example 1 (“GPT with process-based supervision”): * “AI does some stuff” is GPT-3 running through 96 serial layers of transformer-architecture computations. * The “human-inspectable work product” is GPT-3 printing a token and we can look at it and decide if we’re happy about it. Example 2 (“AutoGPT with outcome-based supervision”): * “AI does some stuff” is AutoGPT spending 3 days doing whatever it thinks is best. * The “human-inspectable work product” is I see whether there is extra money in my bank account or not. Example 0 (“Even more process-based than example 1”): * “AI does some stuff” is GPT-3 stepping through just one of the 96 layers of transformer-architecture computations. * The “human-inspectable work product” is the activation vector at this particular NN layer. (Of course, this is only “human-inspectable” if we have good interpretability!) ~~ I think that it’s good (for safety) to shorten the cycles, i.e. Example 2 is more dangerous than Example 1 which is more dangerous than Example 0. I think we’re in agreement here. I also think it’s good (for safety) to try to keep the AI from manipulating the real world and seeing the consequences within a single “AI does some stuff” step, i.e. Example 2 is especially bad in a way that neither Examples 0 nor 1 are. I think we’re in agreement here too. I don’t think either of those good ideas is sufficient to give us a strong reason to believe the AI is safe. But I guess you agree with that too. (“…at best highly uncertain rather than "strong default of danger."”) Yeah, basically that. My concerns are: *  We’re training the AI to spend each of its “AI does some stuff” periods doing t

I think that's a legit disagreement. But I also claim that the argument I gave still works if you assume that AI is trained exclusively using RL - as long as that RL is exclusively "process-based." So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result. 

It still seems, here, like you're not reinforcing unintended behaviors, so the concern comes exclusively from the kind of goal misgeneralization you'd get without having any p... (read more)

2Steve Byrnes9mo
Ohh, sorry you had to tell me twice, but maybe I’m finally seeing where we’re talking past each other. Back to the OP, you wrote: When I read that, I was thinking that you meant: * I type in: “Hey AI, tell me a plan for ethically making lots of money” * The AI brainstorms for an hour * The AI prints out a plan * I grade the plan (without actually trying to execute it), and reward the AI / backprop-through-time the AI / whatever based on that grade. But your subsequent replies make me think that this isn’t what you meant, particularly the “brainstorm for an hour” part. …But hold that thought while I explain why I don’t find the above plan very helpful (just so you understand my previous responses): * A whole lot is happening during the hour that the AI is brainstorming * We have no visibility into any of that, and very weak control over it (e.g. a few bits of feedback on a million-step brainstorming session) * I think RL with online-learning is central to making the brainstorming step actually work, capabilities-wise * I likewise think that RL process would need to be doing lots of recursing onto instrumental subgoals and finding new creative problem-solving strategies etc. * Even if its desires are something like “I want to produce a good plan”, then it would notice that hacking out of the box would be instrumentally useful towards that goal. OK, so that’s where I was coming from in my previous replies. But, now I no longer think that the above is what you meant in the first place. Instead I think you meant: * I type in: “Hey AI, tell me a plan for ethically making lots of money” * The AI prints out every fine-grained step of the process by which it answers that question * I do random local audits of that printout (without actually trying to execute the whole plan). Is that right? If so, that makes a lot more sense. In my (non-LLM) context, I would re-formulate the above as something like: * The AI is doing whatever * We sometimes pick ran

Some reactions on your summary:

  • In process-based training, X = “produce a good plan to make money ethically”

This feels sort of off as a description - what actually might happen is that it takes a bunch of actual steps to make money ethically, but steps are graded based on audits of whether they seem reasonable without the auditor knowing the outcome.

  • In process-based training, maybe Y = “produce a deliberately deceptive plan” or “hack out of the box”.

The latter is the bigger concern, unless you mean the former as aimed at something like the latter. E.g., pro... (read more)

2Steve Byrnes9mo
Thanks, that all makes sense. I think an important thing behind the scenes here is that you’re hopeful that we can get to TAI using mostly self-supervised learning, whereas I’m expecting that people are eventually going to switch to AI approaches that involve RL in a much more central way than LLMs do today. (More like humans, where for every thought you think, you’re thinking it in part because it’s more rewarding than whatever alternate thoughts you might think instead.) In other words, I’m planning for futures where your “Hypothetical Training Approach” doesn’t work. I think it’s a good rule-of-thumb to say that the more centrally RL is involved in TAI, as opposed to self-supervised learning, the harder is safety & alignment in general. (So, I’m pretty unhappy to hold the belief that RL will probably be very centrally involved in TAI!) So anyway, in the scenarios I’m thinking about, I think “goal misgeneralization and instrumental reasoning” is far more likely and difficult to avoid, compared to what you’re thinking about. I suppose process-based optimization would help a nonzero amount even in my scenarios, but sufficiently little that it doesn’t really seem worth thinking about. (Obviously, nobody knows how to make TAI, so I think it’s fine and healthy for different people to be thinking through different scenarios! We’ll find out sooner or later!)

This feels a bit to me like assuming the conclusion. "Rose" is someone who already has aims (we assume this when we imagine a human); I'm talking about an approach to training that seems less likely to give rise to dangerous aims. The idea of the benefit, here, is to make dangerous aims less likely (e.g., by not rewarding behavior that affects the world through unexpected and opaque pathways); the idea is not to contain something that already has dangerous aims (though I think there is some hope of the latter as well, especially with relatively early human-level-ish AI systems).

5Steve Byrnes9mo
That was helpful, thanks! I will try to rephrase. Let’s consider two failure modes. First, define some terms: X is what we want the AI to be trying to do: * In outcome-based training, X = “make money ethically” * In process-based training, X = “produce a good plan to make money ethically” Y is something we want the AI to not try to do: * In outcome-based training, maybe Y = “secretly steal money” * In process-based training, maybe Y = “produce a deliberately deceptive plan” or “hack out of the box”. Now consider two failure modes. FAILURE MODE 1: * We were hoping for the AI to want to do X. * AI does Y, a little bit, randomly or incompetently. * AI is rewarded for doing Y. * AI starts trying to do Y and generalizations-of-Y more and more. FAILURE MODE 2: * We were hoping for the AI to want to do X. * AI wants to do Y. * AI does Y when it finds an opportunity to do so successfully. My understanding is that you’re thinking about Failure Mode 1 here, and you’re saying that process-based training will help because there it’s less difficult to supervise really well, such that we’re not rewarding the AI for doing Y a little bit / incompetently / randomly. If so—OK, fair enough. However, we still need to deal with Failure Mode 2. One might hope that Failure Mode 2 won’t happen because the AI won’t want to do Y in the first place, because after all it’s never done Y before and got rewarded. However, you can still get Y from goal misgeneralization and instrumental reasoning. (E.g., it’s possible for the AI to generalize from its reward history to “wanting to get reward [by any means necessary]”, and then it wants to hack out of the box for instrumental reasons, even if it’s never done anything like that before.) So, I can vaguely imagine plans along the lines of: * Solve Failure Mode 1 by giving near-perfect rewards * Solve Failure Mode 2 by, ummm, out-of-distribution penalties / reasoning about inductive biases / adversarial training / something

I hear you on this concern, but it basically seems similar (IMO) to a concern like: "The future of humanity after N more generations will be ~without value, due to all the reflection humans will do - and all the ways their values will change - between now and then." A large set of "ems" gaining control of the future after a lot of "reflection" seems like quite comparable to future humans having control over the future (also after a lot of effective "reflection").

I think there's some validity to worrying about a future with very different values from today'... (read more)

I see, thanks. I feel like the closest analogy here that seems viable to me would be to something like: is Open Philanthropy able to hire security experts to improve its security and assess whether they're improving its security? And I think the answer to that is yes. (Most of its grantees aren't doing work where security is very important.)

It feels harder to draw an analogy for something like "helping with standards enforcement," but maybe we could consider OP's ability to assess whether its farm animal welfare grantees are having an impact on who adheres to what standards, and how strong adherence is? I think OP has pretty good (not perfect) ability to do so.

(Chiming in late, sorry!)

I think #3 and #4 are issues, but can be compensated for if aligned AIs outnumber or outclass misaligned AIs by enough. The situation seems fairly analogous to how things are with humans - law-abiding people face a lot of extra constraints, but are still collectively more powerful.

I think #1 is a risk, but it seems <<50% likely to be decisive, especially when considering (a) the possibility for things like space travel, hardened refuges, intense medical interventions, digital people, etc. that could become viable with aligned... (read more)

I think I find the "grokking general-purpose search" argument weaker than you do, but it's not clear by how much.

The "we" in "we can point AIs toward and have some ability to assess" meant humans, not Open Phil. You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?

The basic analogy is roughly "if we want a baseline for how hard it will be to evaluate an AI's outputs on their own terms, we should look at how hard it is to evaluate humans' outputs on their own terms, especially in areas similar in some way to AI safety". My guess is that you already have lots of intuition about how hard it is to assess results, from your experience assessing grantees, so that's the intuition I was trying to pump. In particular, I'm guessing that you've found first hand that things are much harder to properly evaluate than it might seem at first glance. If you think generic "humans" (or humans at e.g. Anthropic/OpenAI/Deepmind, or human regulators, or human ????) are going to be better at the general skill of evaluating outputs than yourself or the humans at Open Phil, then I think you underestimate the skills of you and your staff relative to most humans. Most people do not perform any minimal-trust investigations. So I expect your experience here to provide a useful conservative baseline.

I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later. I expect alignment researchers to be central to automation attempts.

It seems to me like the basic equation is something like: "If today's alignment researchers would be able to succeed given a lot more time, then they also are reasonably likely to succeed given access to a lot of human-level-ish AIs." There are reasons this could fail (perhaps future alignmen... (read more)

Indeed, I think you're a good role model in this regard and hope more people will follow your example.

It seems like we could simply try to be as vigilant elsewhere as we would be without this measure, and then we could reasonably expect this measure to be net-beneficial (*how* net beneficial is debatable).

I now think I wrote that part poorly. The idea isn't so much that we say to an AI, "Go out and do whatever you need to do - accumulate money, hire analysts, run experiments, etc. - and come back with a plan that we will evaluate."

The idea is more like this:

  1. We want to accomplish X.
  2. We describe X to an AI.
  3. The AI proposes a next step toward X, based entirely on thinking about it (and not doing other stuff like e.g. hiring researchers - though its proposed next step can be "Hire researchers").
  4. With chance (1-p), the step is simply executed, with no gradient desc
... (read more)
2Steve Byrnes1y
Thanks! Hmm. I wasn’t making the mistake that you attribute to me in your 1st paragraph. I did get that. Yeah, I was talking about the “inner alignment problem leading to an AI optimizing for something other than how its proposed steps are rated”. (Or “goal misgeneralization” or whatever.) Like, let’s replace the AI with a very smart person, Rose, who we’re keeping in a box. (I’ll get back to the question of whether this set of intuitions is helpful vs misleading.) Your text becomes something like: 1. We want to accomplish X. 2. We describe X to Rose. 3. Rose proposes a next step towards X, based entirely on thinking about it for 15 minutes. 4. With chance (1-p), the step is simply executed. With chance p, we drill down on whether it was a good proposal, and give Rose a reward or punishment based on that analysis. In this context, the obvious question is: What is Rose actually trying to do? Is she trying to propose a good next step, as an end in itself, or is she out to get us? If she’s out to get us, then problems arise in both Steps 3 & 4. * For Step 3, the text above says “based entirely on thinking about it for 15 minutes”, but Rose doesn’t care. When it’s time for Step 3, Rose will spend many of those 15 minutes actively trying to escape from her box. That’s why I was talking about a “perfect, inescapable box” in my previous comment. * And then in Step 4, we’re thinking about better proposals and worse proposals, but we might not be prepared for adversarially-chosen proposals. The stereotypical example is that Rose sends in a so-called “proposal” that’s just a text file saying “Help me help me, I’m trapped in a box, it’s awful in here, let me tell you about it…”. 😛 So anyway, that’s a different set of intuitions. Whether it’s a helpful set of intuitions depends on whether SOTA AI algorithms will eventually have agent-y properties like planning, instrumental convergence, creative outside-the-box brainstorming, self-awareness / situational-awareness

(Sorry for the long delay here!) The post articulates a number of specific ways in which some AIs can help to supervise others (e.g., patching security holes, generating inputs for adversarial training, finding scary inputs/training processes for threat assessment), and these don't seem to rely on the idea that an AI can automatically fully understand the internals/arguments/motivations/situation of a sufficiently close-in-capabilities other AI. The claim is not that a single supervisory arrangement of that type wipes out all risks, but that enough investm... (read more)

(Chiming in late here, sorry!) I think this is a totally valid concern, but I think it's generally helpful to discuss technical and political challenges separately. I think pessimistic folks often say things like "We have no idea how to align an AI," and I see this post as a partial counterpoint to that.

In addition to a small alignment tax (as you mention), a couple other ways I could see the political side going well would be (a) an AI project using a few-month lead to do huge amounts of further helpful work (; (b) a standards-and-monitoring regime blocking less cautious training and deployment.

(Chiming in late here, sorry!)

It seems to me like the main crux here is that you're picturing a "phase transition" that kicks in in a fairly unpredictable way, such that a pretty small increase in e.g. inference compute or training compute could lead to a big leap in capabilities. Does that sound right?

I don't think this is implausible but haven't seen a particular reason to consider it likely.

I agree that "checks and balances" between potentially misaligned AIs are tricky and not something we should feel confident in, due to the possibility of sandbagging... (read more)

The phrase I'd use there is "grokking general-purpose search". Insofar as general-purpose search consists of a relatively-simple circuit/function recursively calling itself a lot with different context-specific knowledge/heuristics (e.g. the mental model here), once a net starts to "find" that general circuit/function during training, it would grok for the same reasons grokking happens with other circuits/functions (whatever those reasons are). The "phase transition" would then be relatively sudden for the same reasons (and probably to a similar extent) as in existing cases of grokking. I don't personally consider that argument strong enough that I'd put super-high probability on it, but it's at least enough to privilege the hypothesis. Do you think you/OpenPhil have a strong ability to assess standards enforcement, security, etc, e.g. amongst your grantees? I had the impression that the answer was mostly "no", and that in practice you/OpenPhil usually mostly depend on outside indicators of grantees' background/skills and mission-alignment. Am I wrong about how well you think you can evaluate grantees, or do you expect AI to be importantly different (in a positive direction) for some reason?

I think Nate and I would agree that this would be safe. But it seems much less realistic in the near term than something along the lines of what I outlined. A lot of the concern is that you can't really get to something equivalent to your proposal using techniques that resembles today's machine learning.

3Ramana Kumar1y
Interesting - it's not so obvious to me that it's safe. Maybe it is because avoiding POUDA is such a low bar. But the sped up human can do the reflection thing, and plausibly with enough speed up can be superintelligent wrt everyone else.

With apologies for the belated response: I think greghb makes a lot of good points here, and I agree with him on most of the specific disagreements with Daniel. In particular:

  • I agree that "Bio Anchors doesn't presume we have a brain, it presumes we have transformers. And transformers don't know what to do with a lifetime of experience, at least nowhere near as well as an infant brain does." My guess is that we should not expect human-like sample efficiency from a simple randomly initialized network; instead, we should expect to extensively train a network
... (read more)

I don't think I am following the argument here. You seem focused on the comparison with evolution, which is only a minor part of Bio Anchors, and used primarily as an upper bound. (You say "the number is so vastly large (and actually unknown due to the 'level of details' problem) that it's not really relevant for timelines calculations," but actually Bio Anchors still estimates that the evolution anchor implies a ~50% chance of transformative AI this century.)

Generally, I don't see "A and B are very different" as a knockdown counterargument to "If A requir... (read more)

3Adam Shimi2y
Thanks for the answer! Unfortunately, I don't have the time at the moment to answer in detail and have more of a conversation, as I'm fully focused on writing a long sequence about pushing for pluralism in alignment and extracting the core problem out of all the implementation details and additional assumption. I plan on going back to analyzing timeline research in the future, and will probably give better answers then. That being said, here are quick fire thoughts: * I used the evolution case because I consider it the most obvious/straightforward case, in that it sounds so large that everyone instantly assumes that it gives you an upper bound. * My general impression about this report (and one I expect Yudkowsky to share) is that it didn't made me update at all. I already updated from GPT and GPT3, and I didn't find new bits of evidence in the report and the discussions around it, despite the length of it. My current impression (please bear in mind that I haven't taken the time to study the report from that angle, so I might change my stance) is that this report, much like a lot of timeline work, seems like it takes as input a lot of assumption, and gives as output far less than was assumed. It's the opposite of compression — a lot of assumptions are needed to conclude things that aren't that strong and constraining.
1Matthew Barnett2y
Thanks for the thoughtful reply. Here's my counter-reply. You frame my response as indicating "disagreements". But my tweet said "I broadly agree" with you, and merely pointed out ways that I thought your statements were misleading. I do just straight up disagree with you about two specific non-central claims you made, which I'll get to later. But I'd caution against interpreting me as disagreeing with you by any degree greater than what is literally implied by what I wrote. Before I get to the specific disagreements, I'll just bicker about some points you made in response to me. I think this sort of quibbling could last forever and it would serve little purpose to continue past this point, so I release you from any obligation you might think you have to reply to these points. However, you might still enjoy reading my response here, just to understand my perspective in a long-form non-Twitter format. Note: I continued to edit my response after I clicked "submit", after realizing a few errors of mine. Apologies if you read an erroneous version. My quibbles with what you wrote You said, The fact that the median for the conservative analysis is right at 2100 — which indeed is part of the 21st century — means that when you said, "You can run the bio anchors analysis in a lot of different ways, but they all point to transformative AI this century", you were technically correct, by the slimmest of margins.  I had the sense that many people might interpret your statement as indicating a higher degree of confidence; that is, maybe something like "even the conservative analysis produces a median prediction well before 2100."  Maybe no one misinterpreted you like that!  It's very reasonable for to think that no one would have misinterpreted you. But this incorrect interpretation of your statement was, at least to me, the thinking that I remember having at the time I read the sentence. I intend to produce fuller thoughts on this point in the coming months. In short:

The Bio Anchors report is intended as a tool for making debates about AI timelines more concrete, for those who find some bio-anchor-related bound helpful (e.g., some think we should lower bound P(AGI) at some reasonably high number for any year in which we expect to hit a particular kind of "biological anchor"). Ajeya's work lengthened my own timelines, because it helped me understand that some bio-anchor-inspired arguments for shorter timelines didn't have as much going for them as I'd thought; but I think it may have shortened some other folks'.

(The pre... (read more)

I agree with this. I often default to acting as though we have ~10-15 years, partly because I think leverage is especially high conditional on timelines in that rough range.

I'm not sure why this isn't a very general counterexample. Once we've decided that the human imitator is simpler and faster to compute, don't all further approaches (e.g., penalizing inconsistency) involve a competitiveness hit along these general lines? Aren't they basically designed to drag the AI away from a fast, simple human imitator toward a slow, complex reporter? If so, why is that better than dragging the AI from a foreign ontology toward a familiar ontology?

3Mark Xu2y
There is a distinction between the way that the predictor is reasoning and the way that the reporter works. Generally, we imagine that that the predictor is trained the same way the "unaligned benchmark" we're trying to compare to is trained, and the reporter is the thing that we add onto that to "align" it (perhaps by only training another head on the model, perhaps by finetuning). Hopefully, the cost of training the reporter is small compared to the cost of the predictor (maybe like 10% or something) In this frame, doing anything to train the way the predictor is trained results in a big competitiveness hit, e.g. forcing the predictor to use the same ontology as a human is potentially going to prevent it from using concepts that make reasoning much more efficient. However, training the reporter in a different way, e.g. doubling the cost of training the reporter, only takes you from 10% of the predictor to 20%, which not that bad of a competitiveness hit (assuming that the human imitator takes 10% of the cost of the original predictor to train). In summary, competitiveness for ELK proposals primarily means that you can't change the way the predictor was trained. We are already assuming/hoping the reporter is much cheaper to train than the predictor, so making the reporter harder to train results in a much smaller competitiveness hit.

Can you explain this: "In Section: specificity we suggested penalizing reporters if they are consistent with many different reporters, which effectively allows us to use consistency to compress the predictor given the reporter." What does it mean to "use consistency to compress the predictor given the reporter" and how does this connect to penalizing reporters if they are consistent with many different predictors?

1Mark Xu2y
A different way of phrasing Ajeya's response, which I think is roughly accurate, is that if you have a reporter that gives consistent answers to questions, you've learned a fact about the predictor, namely "the predictor was such that when it was paired with this reporter it gave consistent answers to questions." if there were 8 predictor for which this fact was true then "it's the [7th] predictor such that when it was paired with this reporter it gave consistent answers to questions" is enough information to uniquely determine the reporter, e.g. the previous fact + 3 additional bits was enough. if the predictor was 1000 bits, the fact that it was consistent with a reporter "saved" you 997 bits, compressing the predictor into 3 bits. The hope is that maybe the honest reporter "depends" on larger parts of the predictor's reasoning, so less predictors are consistent with it, so the fact that a predictor is consistent with the honest reporter allows you to compress the predictor more. As such, searching for reporters that most compressed the predictor would prefer the honest reporter. However, the best way for a reporter to compress a predictor is to simply memorize the entire thing, so if the predictor is simple enough and the gap between the complexity of the human-imitator and the direct translator is large enough, then the human-imitator+memorized predictor is the simplest thing that maximally compresses the predictor.
2Ajeya Cotra2y
Warning: this is not a part of the report I'm confident I understand all that well; I'm trying anyway and Paul/Mark can correct me if I messed something up here. I think the idea here is like: * We assume there's some actual true correspondence between the AI Bayes net and the human Bayes net (because they're describing the same underlying reality that has diamonds and chairs and tables in it). * That means that if we have one of the Bayes nets, and the true correspondence, we should be able to use that rederive the other Bayes net. In particular the human Bayes net plus the true correspondence should let us reconstruct the AI Bayes net; false correspondences that just do inference from observations in the human Bayes net wouldn't allow us to do this since they throw away all the intermediate info derived by the AI Bayes net. * If you assume that the human Bayes net plus the true correspondence are simpler than the AI Bayes net, then this "compresses" the AI Bayes net because you just wrote down a program that's smaller than the AI Bayes net which "unfolds" into the AI Bayes net. * This is why the counterexample in that section focuses on the case where the AI Bayes net was already so simple to describe that there was nothing left to compress, and the human Bayes net + true correspondence had to be larger.

Here are a couple of hand-wavy "stub" proposals that I sent over to ARC, which they thought were broadly intended to be addressed by existing counterexamples. I'm posting them here so they can respond and clarify why these don't qualify.

*Proposal 1: force ontological compatibility*

On page 34 of the ELK gdoc, the authors talk about the possibility that training an AI hard enough produces a model that has deep mismatches with human ontology - that is, it has a distinct "vocabulary of basic concepts" (or nodes in a Bayes net) that are distinct from the ones h... (read more)

2Paul Christiano2y
I think that a lot depends on what kind of term you include. If you just say "find more interesting things" then the model will just have a bunch of neurons designed to look interesting. Presumably you want them to be connected in some way to the computation, but we don't really have any candidates for defining that in a way that does what you want. In some sense I think if the digital neuroscientists are good enough at their job / have a good enough set of definitions, then this proposal might work. But I think that the magic is mostly being done in the step where we make a lot of interpretability progress, and so if we define a concrete version of interpretability right now it will be easy to construct counterexamples (even if we define it in terms of human judgments). If we are just relying on the digital neuroscientists to think of something clever, the counterexample will involve something like "they don't think of anything clever." In general I'd be happy to talk about concrete proposals along these lines. (I agree with Ajeya and Mark that the hard case for this kind of method is when the most efficient way of thinking is totally alien to the human. I think that can happen, and in that case in order to be competitive you basically just need to learn an "interpreted" version of the alien model. That is, you need to basically show that if there exists an alien model with performance X, there is a human-comprehensible model with performance X, and the only way you'll be able to argue that for any model we can define a human-comprehensible model with similar complexity and the same behavior.)

Again trying to answer this one despite not feeling fully solid. I'm not sure about the second proposal and might come back to it, but here's my response to the first proposal (force ontological compatibility):

The counterexample "Gradient descent is more efficient than science" should cover this proposal because it implies that the proposal is uncompetitive. Basically, the best Bayes net for making predictions could just turn out to be the super incomprehensible one found by unrestricted gradient descent, so if you force ontological compatibility then you ... (read more)

Regarding this:

The bad reporter needs to specify the entire human model, how to do inference, and how to extract observations. But the complexity of this task depends only on the complexity of the human’s Bayes net.

If the predictor's Bayes net is fairly small, then this may be much more complex than specifying the direct translator. But if we make the predictor's Bayes net very large, then the direct translator can become more complicated — and there is no obvious upper bound on how complicated it could become. Eventually direct translation will be more co

... (read more)
2Paul Christiano2y
Yes, I agree that something similar applies to complexity as well as computation time. There are two big reasons I talk more about computation time: * It seems plausible we could generate a scalable source of computational difficulty, but it's less clear that there exists a scalable source of description complexity (rather than having some fixed upper bound on the complexity of "the best thing a human can figure out by doing science.") * I often imagine the assistants all sharing parameters with the predictor, or at least having a single set of parameters. If you have lots of assistant parameters that aren't shared with the predictor, then it looks like it will generally increase the training time a lot. But without doing that, it seems like there's not necessarily that much complexity the predictor doesn't already know about. (In contrast, we can afford to spend a ton of compute for each example at training time since we don't need that many high-quality reporter datapoints to rule out the bad reporters. So we can really have giant ratios between our compute and the compute of the model.) But I don't think these are differences in kind and I don't have super strong views on this.

(Note: I read an earlier draft of this report and had a lot of clarifying questions, which are addressed in the public version. I'm continuing that process here.)

I get the impression that you see most of the "builder" moves as helpful (on net, in expectation), even if there are possible worlds where they are unhelpful or harmful. For example, the "How we'd approach ELK in practice" section talks about combining several of the regularizers proposed by the "builder." It also seems like you believe that combining multiple regularizers would create a "stacking... (read more)

5Paul Christiano2y
This is because of the remark on ensembling---as long as we aren't optimizing for scariness (or diversity for diversity's sake), it seems like it's way better to have tons of predictors and then see if any of them report tampering. So adding more techniques improves our chances of getting a win. And if the cost of fine-tuning a reporters is small relative to the cost of training the predictor, we can potentially build a very large ensemble relatively cheaply. (Of course, having more techniques also helps because you can test many of them in practice and see which of them seem to really help.) This is also true for data---I'd be scared about generating a lot of riskier data, except that we can just do both and see if either of them reports tampering in a given case (since they appear to fail for different reasons). I believe this in a few cases (especially combining "compress the predictor," imitative generalization, penalizing upstream dependence, and the kitchen sink of consistency checks) but mostly the stacking is good because ensembling means that having more and more options is better and better. I don't think the kind of methodology used in this report (or by ARC more generally) is very well-equipped to answer most of these questions. Once we give up on the worst case, I'm more inclined to do much messier and more empirically grounded reasoning. I do think we can learn some stuff in advance but in order to do so it requires getting really serious about it (and still really wants to learn from early experiments and mostly focus on designing experiments) rather than taking potshots. This is related to a lot of my skepticism about other theoretical work. I do expect the kind of research we are doing now to help with ELK in practice even if the worst case problem is impossible. But the particular steps we are taking now are mostly going to help by suggesting possible algorithms and difficulties; we'd then want to give those as one input into that much messier