# All of David Scott Krueger (formerly: capybaralet)'s Comments + Replies

I'm not necessarily saying people are subconsciously trying to create a moat.

I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).

2Daniel Kokotajlo1d
It sure sounds like you are saying that though! Before you put in the EtA, it sure sounded like you were saying that people were subconsciously motivated to avoid academic publishing because it helped them build and preserve a moat. Now, after the EtA, it still sounds like that but is a bit more unclear since 'indirect' is a bit more ambiguous than 'subconscious.'

Q: "Why is that not enough?"
A: Because they are not being funded to produce the right kinds of outputs.

My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.

Speaking for myself…

I think I do a lot of “engaging with neuroscientists” despite not publishing peer-reviewed neuroscience papers:

• I write lots of blog posts intended to be read by neuroscientists, i.e. I will attempt to engage with background assumptions that neuroscientists are likely to have, not assume non-neuroscience background knowledge or jargon, etc.
• [To be clear, I also write even more blog posts that are not in that category.]
• When one of my blog posts specifically discusses some neuroscientist’s work, I’ll sometimes cold-email them and ask for pr
...

My point (see footnote) is that motivations are complex.  I do not believe "the real motivations" is a very useful concept here.

The question becomes why "don't they judge those costs to be worth it"?  Is there motivated reasoning involved?  Almost certainly yes; there always is.

2Daniel Kokotajlo2d
Here are two hypotheses for why they don't judge those costs to be worth it, each one of which is much more plausible to me than the one you proposed: (1) The costs aren't in fact worth it & they've reacted appropriately to the evidence. (2) The costs are worth it, but thanks to motivated reasoning, they exaggerate the costs, because writing things up in academic style and then dealing with the publication process is boring and frustrating. Seriously, isn't (2) a much better hypothesis than the one you put forth about moats?
1. A lot of work just isn't made publicly available
2. When it is, it's often in the form of ~100 page google docs
3. Academics have a number of good reasons to ignore things that don't meet academic standards or rigor and presentation
2David Scott Krueger3d
In my experience people also often know their blog posts aren't very good.

1Andrea_Miotti12d
Which one? All of them seem to be working for me.

Yeah this was super unclear to me; I think it's worth updating the OP.

FYI: my understanding is that "data poisoning" refers to deliberately the training data of somebody else's model which I understand is not what you are describing.

Oh I see.  I was getting at the "it's not aligned" bit.

Basically, it seems like if I become a cyborg without understanding what I'm doing, the result is either:

• I'm in control
• The machine part is in control
• Something in the middle

Only the first one seems likely to be sufficiently aligned.

I don't understand the fuss about this; I suspect these phenomena are due to uninteresting, and perhaps even well-understood effects.  A colleague of mine had this to say:

...

Indeed.  I think having a clean, well-understood interface for human/AI interaction seems useful here.  I recognize this is a big ask in the current norms and rules around AI development and deployment.

I don't understand what you're getting at RE "personal level".

0rpglover641mo
Like, I may not want to become a cyborg if I stop being me, but that's a separate concern from whether it's bad for alignment (if the resulting cyborg is still aligned).

I think the most fundamental objection to becoming cyborgs is that we don't know how to say whether a person retains control over the cyborg they become a part of.

6Nicholas Kees Dupuis1mo
I agree that this is important. Are you more concerned about cyborgs than other human-in-the-loop systems? To me the whole point is figuring out how to make systems where the human remains fully in control (unlike, e.g. delegating to agents), and so answering this "how to say whether a person retains control" question seems critical to doing that successfully.
3rpglover641mo
I think that's an important objection, but I see it applying almost entirely on a personal level. On the strategic level, I actually buy that this kind of augmentation (i.e. with in some sense passive AI) is not an alignment risk (any more than any technology is). My worry is the "dual use technology" section.

FWIW, I didn't mean to kick off a historical debate, which seems like probably not a very valuable use of y'all's time.

Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die.  Even existential risk has this potential, actually, but I think it's a safer bet.

I don't think we should try and come up with a special term for (1).
The best term might be "AI engineering".  The only thing it needs to be distinguished from is "AI science".

I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.

I say it is a rebrand of the "AI (x-)safety" community.
When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant.  "AI safety" was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.

I don't think the history is that important; what's important is having good terminology going forward.
This is also why I stress that I work on AI existential safety.

So I think people shou...

Hmm... this is a good point.

I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way.  One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.

I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”.

1) I don't think this dichotomy is as solid as it seems once you start poking at it... e.g. in your war example, it would be odd to say that the designers of the AGI systems that wiped out humans intended for that outcome to occur.  Intentions are perhaps best thought of as incomplete specifications.

2) From our current position, I think “never ever create...

2Steve Byrnes1mo
1) Oh, sorry, what I meant was, the generals in Country A want their AGI to help them “win the war”, even if it involves killing people in Country B + innocent bystanders. And vice-versa for Country B. And then, between the efforts of both AGIs, the humans are all dead. But nothing here was either an “AGI accident unintended-by-the-designers behavior” nor “AGI misuse” by my definitions. But anyway, yes I can imagine situations where it’s unclear whether “the AGI does things specifically intended by its designers”. That’s why I said “pretty solid” and not “rock solid” :) I think we probably disagree about whether these situations are the main thing we should be talking about, versus edge-cases we can put aside most of the time. From my perspective, they’re edge-cases. For example, the scenarios where a power-seeking AGI kills everyone are clearly on the “unintended” side of the (imperfect) dichotomy. But I guess it’s fine that other people are focused on different problems from me, and that “intent-alignment is poorly defined” may be a more central consideration for them. ¯\_(ツ)_/¯ 3) I like your “note on terminology post”. But I also think of myself as subscribing to “the conventional framing of AI alignment”. I’m kinda confused that you see the former as counter to the latter. If you’re working on that, then I wish you luck! It does seem maybe feasible to buy some time. It doesn’t seem feasible to put off AGI forever. (But I’m not an expert.) It seems you agree. * Obviously the manual will not be written by one person, and obviously some parts of the manual will not be written until the endgame, where we know more about AGI than we do today [https://www.lesswrong.com/posts/MCWGCyz2mjtRoWiyP/endgame-safety-for-agi]. But we can still try to make as much progress on the manual as we can, right? * The post you linked [https://coordination.substack.com/p/alignment-is-not-enough] says “alignment is not enough”, which I see as obviously true, but t

While defining accident as “incident that was not specifically intended & desired by the people who pressed ‘run’ on the AGI code” is extremely broad, it still supposes that there is such a thing as "the AGI code", which  significantly restricts the space of possibile risks.

There are other reasons I would not be happy with that browser extension.  There is not one specific conversation I can point to; it comes up regularly.  I think this replacement would probably lead to a lot of confusion, since I think when people use the word "accide...

4Steve Byrnes2mo
Thanks for your reply! It continues to feel very bizarre to me to interpret the word “accident” as strongly implying “nobody was being negligent, nobody is to blame, nobody could have possibly seen it coming, etc.”. But I don’t want to deny your lived experience. I guess you interpret the word “accident” as having those connotations, and I figure that if you do, there are probably other people who do too. Maybe it’s a regional dialect thing, or different fields use the term in different ways, who knows. So anyway, going forward, I will endeavor to keep that possibility in mind and maybe put in some extra words of clarification where possible to head off misunderstandings. :) I agree with this point. I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”. I want to use the word “accident” universally for all bad outcomes downstream of (B), regardless of how grossly negligent and reckless people were etc., whereas you don’t want to use the word “accident”. OK, let’s put that aside. I think (?) that we both agree that bad outcomes downstream of (A) are not necessarily related to “misuse” / “bad actors”. E.g., if there’s a war with AGIs on both sides, and humans are wiped out in the crossfire, I don’t necessarily want to say that either side necessarily was “bad actors”, or that either side’s behavior constitutes “misuse”. So yeah, I agree that “accident vs misuse” is not a good dichotomy for AGI x-risk. Thanks, that’s interesting. I didn’t intend my chart to imply that “everyone follows the manual” doesn’t also require avoiding coordination problems and avoiding bad decisions etc. Obviously it does—or at least, that was obvious to me. Anyway, your feedback is noted. :) I agree that “never ever create AGI” is an option in principle. (It doesn’t strike me as a feasible option in practice; does it to you? I know this is off-top

I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper).  It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper.  So then you would never have $C(\pi) >> C(U)$.  What am I missing/misunderstanding?

2Vanessa Kosoy2mo
For the contrived reward function you suggested, we would never have C(π)≫C(U). But for other reward functions, it is possible that C(π)≫C(U). Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.

By "intend" do you mean that they sought that outcome / selected for it?
Or merely that it was a known or predictable outcome of their behavior?

I think "unintentional" would already probably be a better term in most cases.

Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...

We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do.  I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).

It seems like this means that, for any policy, we can represent it as optimizing re...

2Vanessa Kosoy2mo
My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand, policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U which "justifies" this g. In other words, your "minimal" overhead is very large from my point of view: to be acceptable, the "overhead" should be substantially negative.

"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context.  I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk.  I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents a...

I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter.  The former easily becomes too political, making coordination harder.

Yes it may be useful in some very limited contexts.  I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing.

AI is highly non-analogous with guns.

I think inadequate equillibrium is too specific and insider jargon-y.

I really don't think the distinction is meaningful or useful in almost any situation.  I think if people want to make something like this distinction they should just be more clear about exactly what they are talking about.

How about the distinction between (A) “An AGI kills every human, and the people who turned on the AGI didn’t want that to happen” versus (B) “An AGI kills every human, and the people who turned on the AGI did want that to happen”?

I’m guessing that you’re going to say “That’s not a useful distinction because (B) is stupid. Obviously nobody is talking about (B)”. In which case, my response is “The things that are obvious to you and me are not necessarily obvious to people who are new to thinking carefully about AGI x-risk.”

…And in particular, normal people s...

This is a great post.  Thanks for writing it!  I think Figure 1 is quite compelling and thought provoking.
I began writing a response, and then realized a lot of what I wanted to say has already been said by others, so I just noted where that was the case.  I'll focus on points of disagreement.

Summary: I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.

A high-level counter-argument I didn't see others making:

• I wasn't entirely sure what was your argument that long-term planning ability s
...

This is a great post.  Thanks for writing it!

I agree with a lot of the counter-arguments others have mentioned.

Summary:

• I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.

• High-level counter-arguments already argued by Vanessa:
• This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
• Humans have not reached the limits of predictive ability

• Predi
...

This post tacitly endorses the "accident vs. misuse" dichotomy.
Every time this appears, I feel compelled to mention I think is a terrible framing.
I believe the large majority of AI x-risk is best understood as "structural" in nature: https://forum.effectivealtruism.org/posts/oqveRcMwRMDk6SYXM/clarifications-about-structural-risk-from-ai

2Steve Byrnes2mo
If you think some more specific aspect of this post is importantly wrong for reasons that are downstream of that, I’d be curious to hear more details. In this post, I’m discussing a scenario where one AGI gets out of control and kills everyone. If the people who programmed and turned on that AGI were not omnicidal maniacs who wanted to wipe out humanity, then I call that an “accident”. If they were omnicidal maniacs then I call that a “bad actor” problem. I think that omnicidal maniacs are very rare in the human world, and therefore this scenario that I’m talking about is an “accident” scenario. From reading the post you linked, my best guess is that (1) You’re not thinking about this scenario in the first place, (2) If you are, then you would say something like “When we use the word ‘accident’, it suggests ‘avoidable accident caused by stupid or reckless people’, but maybe the real problem was a race-to-the-bottom on safety, which need not be avoidable and need not involve any stupid or reckless people.” If it’s (1), this is a whole long post is about why that scenario seems very difficult to avoid, from my current perspective. (If your argument is “that scenario won’t happen because something else will kill us first”, then I happen to disagree, but that’s off-topic for this post.) If it’s (2), well I don’t see why the word “accident” has to have that connotation. It doesn’t have that connotation to me. I think it’s entirely possible for people who are neither stupid nor reckless to cause x-risk by accident. A lot of this post is about structural factors, seems to me, and Section 3.4 in particular seems to be an argument which is structural in nature, where I note that by default, more and more people are going to train more and more powerful AGIs, and thus somebody is going to make one that is motivated to cause mass destruction sooner or later—even if there isn’t a race-to-the-bottom on safety / alignment, and certainly if there is. That’s a structural argu

I understand your point of view and think it is reasonable.

However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other.  I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.

I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL...

(A very quick response):

Agree with (1) and (2).
I am ambivalent RE (3) and the replaceability arguments.
RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".

it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe

I would say "it may be better, and people should seriously consider this" not "it is better".

I don't really buy this argument.

• I think the following is a vague and slippery concept: "a kind of generalized reverse-engineering/interpretability skill".  But I agree that you would want to do testing, etc. of any system before you deploy it.
• It seems like the ambitious goal of mechanistic interpretability, which would get you the kind of safety properties we are after, would indeed require explaining how the entire process works. But when we are talking about such a complex system, it seems the main obstacle to understanding for either approa
...

Hmm I feel a bit damned by faint praise here... it seems like more than type-checking, you are agreeing substantively with my points (or at least, I fail to find any substantive disagreement with/in your response).

Perhaps the main disagreement is about the definition of interpretability, where it seems like the goalposts are moving... you say (paraphrasing) "interpretability is a necessary step to robustly/correctly grounding symbols".  I can interpret that in a few ways:

1. "interpretability := mechanistic interpretability (as it is currently practiced)"
...

It could work if you can use interpretability to effectively prohibit this from happening before it is too late.  Otherwise, it doesn't seem like it would work.

Actually I really don't think it does... the argument there is that:

• interpretability is about understanding how concepts are grounded.
• symbolic methods don't tell us anything about how their concepts are grounded.

This is only tangentially related to the point I'm making in my post, because:

• A lot of interpretability is about discovering how concepts are used in a higher-level algorithm, and the argument doesn't apply there.
• I am comparing mechanistic interpretability of neural nets with neuro-symbolic methods.
• One point of using such methods is to enforce or encourage certain high-level algorithmic properties, e.g. modularity.
3johnswentworth5mo
That counterargument does at least typecheck, so we're not talking past each other. Yay! In the context of neurosymbolic methods, I'd phrase my argument like this: in order for the symbols in the symbolic-reasoning parts to robustly mean what we intended them to mean (e.g. standard semantics in the case of natural language), we need to pick the right neural structures to "hook them up to". We can't just train a net to spit out certain symbols given certain inputs and then use those symbols as though they actually correspond to the intended meaning, because <all the usual reasons why maximizing a training objective does not do the thing we intended>. Now, I'm totally on board with the general idea of using neural nets for symbol grounding and then building interpretable logic-style stuff on top of that. (Retargeting the Search [https://www.lesswrong.com/posts/w4aeAFzSAguvqA5qu/how-to-go-from-interpretability-to-alignment-just-retarget] is an instance of that general strategy, especially if we use a human-coded search algorithm.) But interpretability is a necessary step to do that, if we want the symbols to be robustly correctly grounded. On to the specifics: I partially buy that. It does seem to me that a lot of people doing interpretability don't really seem have a particular goal in mind, and are just generally trying to understand what's going on. Which is not necessarily bad; understanding basically anything in neural nets (including higher-level algorithms) will probably help us narrow in on the answers to the key questions. But it means that a lot of work is not narrowly focused on the key hard parts (i.e. how to assign external meaning to internal structures). Insofar as the things passing between modules are symbols whose meaning we don't robustly know, the same problem comes up. The usefulness of structural/algorithmic properties is pretty limited, if we don't have a way to robustly assign meaning to the things passing between the parts.

I think that it implicitly uses the words 'mechanistic interpretability' differently than people typically do.

I disagree.  I think in practice people say mechanistic interpretability all the time and almost never say these other more specific things.  This feels a bit like moving the goalposts to me.  And I already said in the caveats that it could be useful even if the most ambitious version doesn't pan out.

For instance, we don't need to understand how models predict whether the next token is ' is' or ' was' in order to be able to ga

...

we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine.

If this is true, then it makes (mechanistic) interpretability much harder as well, as we'll need our interpretability tools to somehow teach us these underlying principles, as you go on to say.  I don't think this is the primary stated motivation for mechanistic interpretability.  The main stated motivations seem to be roughly "We can figure out if the model is doing bad (e.g. deceptive) stuff and then do one or more of: 1) not deploy it, 2) not build systems that way, 3) train against our operationalization of deception"

I agree it's a spectrum.  I would put it this way:

• For any point on the spectrum there is some difficulty in achieving it.
• We can approach that point from either direction, 1) starting with a "big blob of compute" and encountering the difficulty in extracting these pieces from the blob, or 2) starting with assembling the pieces, and encountering the difficulty in figuring out how to assemble them.
• It's not at all clear that (1) would be easier than (2).
• Probably it's best to do some of both.

Regarding difficulty of (1) vs. (2), OTMH, there may ...

3Scott Emmons5mo
I'd be interested to hear in more detail why you're unconvinced.

Responding in order:
1) yeah I wasn't saying it's what her post is about.  But I think you can get two more interesting cruxy stuff by interpreting it that way.
2) yep it's just a caveat I mentioned for completeness.
3) Your spontaneous reasoning doesn't say that we/it get(/s) good enough at getting it to output things humans approve of before it kills us.  Also, I think we're already at "we can't tell if the model is aligned or not", but this won't stop deployment.  I think the default situation isn't that we can tell if things are going wrong, but people won't be careful enough even given that, so maybe it's just a difference of perspective or something... hmm.......

TBC, "more concerned" doesn't mean I'm not concerned about the other ones... and I just noticed that I make this mistake all the time when reading people say they are more concerned about present-day issues than x-risk....... hmmm........

This is a great response to a great post!
I mostly agree with the points made here, so I'll just highlight differences in my views.

Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable".  I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").

A few other notable differences:

• We don't necessarily ne
...
2Noosphere895mo
This is a crux for me, as it is why I don't think slow takeoff is good by default. I think deceptive alignment is the default state barring interpretability efforts that are strong enough to actually detect mesa-optimizers or myopia. Yes, Foom is probably not going to happen, but in my view that doesn't change much regarding risk in total.
2Erik Jenner5mo
Thanks for the interesting comments! Yeah, I didn't understand Katja's post as arguing (1), otherwise we'd have said more about that. Section C contains reasons for slow take-off, but my crux is mainly how much slow takeoff really helps (most of the reasons I expect iterative design to fail for AI still apply, e.g. deception or "getting what we measure"). I didn't really see arguments in Katja's post for why slow takeoff means we're fine. Agreed, and I think this is a weakness of our post. I have a sense that most of the arguments you could make using the "existentially secure state" framing could also be made more generally, but I haven't figured out a framing I really like yet unfortunately. Would be interested in discussing this more at some point. Given your comment, I'd now guess I dismissed this too quickly and there are things I haven't thought of. My spontaneous reasoning for being less concerned about this is something like "the better our models become (e.g. larger and larger pretrained models), the easier it should be to make them output things humans approve of". An important aspect is also that this is the type of problem where it's more obvious if things are going wrong (i.e. iterative design should work here---as long as we can tell the model isn't aligned yet, it seems more plausible we can avoid deploying it).

"Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.

(sort of nitpicking):
I think it makes more sense to look for the highest density in pixel space; this requires integrating over all settings of the latents (unless your generator is invertible, in which case you can just use change o...

author here -- Yes we got this comment from reviewers in the most recent round as well.  ADS is a bit more general than performative prediction, since it applies outside of prediction context.  Still very closely related.

On the other hand, The point of our work is something that people in the performative prediction community seem to only slowly be approaching, which is the incentive for ADS.  Work on CIDs is much more related in that sense.

As a historical note: We starting working on this March or April 2018; Performative prediction was on arXiv Feb 2020, ours was at a safety workshop in mid 2019, but not on arXiv until Sept 2020.