All of johnswentworth's Comments + Replies

In summary, saying "accident" makes it sounds like an unpredictable effect, instead of painfully obviously risk that was not taken seriously enough.

Personally, I usually associate "accident" with "painfully obvious risk that was not actually mitigated" (note the difference in wording from "not taken seriously enough"). IIUC, that's usually how engineering/industrial "accidents" work, and that is the sort of association I'd expect someone used to thinking about industrial "accidents" to have.

Yes! There's two ways that can be relevant. First, a ton of bits presumably come from unsupervised learning of the general structure of the world. That part also carries over to natural abstractions/minimal latents: the big pile of random variables from which we're extracting a minimal latent is meant to represent things like all those images the toddler sees over the course of their early life.

Second, sparsity: most of the images/subimages which hit my eyes do not contain apples. Indeed, most images/subimages which hit my eyes do not contain instances of ... (read more)

Point is that the "Structural(Inner) prediction method" doesn't seem particularly likely to generalize across things-which-look-like-big-neural-nets. It more plausibly generalizes across things-which-learn-to-perform-diverse-tasks-in-diverse-environments, but I don't think neural net aspect is carrying very much weight there.

4Daniel Kokotajlo1mo
OK, on reflection I think I tentatively agree with that.

This is some evidence that it'll work for AGIs too; after all, both humans and AGIs are massive neural nets that learn to perform diverse tasks in diverse environments.

Highly debatable whether "massive neural nets that learn to perform diverse tasks in diverse environments" is a natural category. "Massive neural net" is not a natural category - e.g. transformers vs convnets vs boltzmann machines are radically different things, to the point where understanding one tells us very little about the others. The embedding of interpretable features of one does not... (read more)

2Daniel Kokotajlo1mo
Naturalness of categories is relative. Of course there are important differences between different kinds of massive neural nets that learn to perform diverse tasks in diverse environments. I still think it's fair to draw a circle around all of them to distinguish them from e.g. software like Microsoft Word, or AIXI, or magic quantum suicide outcome pumps, or bacteria.

Great post! I think the things said in the post are generally correct - in particular, I agree with the overall point that objective-centric arguments (e.g. power-seeking) are plausible, and therefore support a high enough probability of doom to justify alignment work, but aren't sufficiently probable to justify a very high probability of doom.

That said, I do think a very high probability of doom can be justified. The arguments have to route primarily through failure of the iterative design loop for AI alignment in particular, rather than primarily through... (read more)

3Rohin Shah1mo
Yeah, I didn't talk about that argument, or the argument that multiagent effects lead to effectively-randomly-chosen world states (see ARCHES []), because those arguments don't depend on how future AI systems will generalize and so were outside the scope of this post. A full analysis of p(doom) would need to engage with such arguments.

Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.

I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).

FWIW, this was basically my thinking on ... (read more)

1Thomas Kwa1mo
A while ago you wanted a [] few [] posts [] on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?
5Rohin Shah1mo
Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization. I'd be pretty surprised though if more than one person in my survey had that view.

Roughly, yeah. I currently view the types of  and  as the "low-level" type signature of abstraction, in some sense to be determined. I expect there are higher-level organizing principles to be found, and those will involve refinement of the types and/or different representations.

The main problem I see with hodge-podge-style strategies is that most alignment ideas fail in roughly-the-same cases, for roughly-the-same reasons. It's the same hard cases/hard subproblems which kill most plans. In particular, section B.2 (and to a lesser extent B.1 - B.3) of List of Lethalities covers "core problems" which strategies usually fail to handle.

In terms of methodology, epistemology, etc, what did you do right/wrong? What advice would you today give to someone who produced something like your old goal-deconfusion work, or what did your previous self really need to hear?

I want to see Adam do a retrospective on his old goal-deconfusion stuff.

2Adam Shimi2mo
What are you particularly interested in? I expect I could probably write it with a bit of rereading.

I only skimmed the post, so apologies if you addressed this problem and I missed it.

Problem: even if the AI's utility function is time-bounded, there may still be other agents in the environment whose utility functions are not time-bounded, and those agents will be willing to trade short-term resources/assistance for long-term resources/assistance. So, for instance, the 10-minute laundry-folding robot might still be incentivized to create a child AI which persists for a long time and seizes lots of resources, in order to trade those future resources to some other agent who can help fold the laundry in the next 10 minutes.

That’s true! Thanks for pointing this out; I added a subsection about it to the post. There are probably also a bunch of other cases I haven’t thought of that provide stories for how the environment directly rewards actions that go against the spirit of the shutdown criterion (besides imitation and this one, which I might call “trade”). This construction does nothing to counteract such incentives. Rather, it just avoids the way that being an infinite-horizon RL agent systematically creates new ones.

Sounds closer. Maybe "there's always surprises"? Or "your pre-existing models/tools/frames are always missing something"? Or "there are organizing principles, but you're not going to guess all of them ahead of time"?

In my opinion, whenever you're faced with a question about like this, it's always more messy than you think

I think this is exactly wrong. I think that mainly because I personally went into biology research, twelve years ago, expecting systems to be fundamentally messy and uninterpretable, and it turned out that biological systems are far less messy than I expected.

We've also seen the same, in recent years, with neural nets. Early on, lots of people expected that the sort of interpretable structure found by Chris Olah & co wouldn't exist. And yet, whene... (read more)

And yet, whenever we actually delve into these systems, it turns out that there's a ton of ultimately-relatively-simple internal structure.

I'm not sure exactly what you mean by "ton of ultimately-relatively-simple internal structure".

I'll suppose you mean "a high percentage of what models use parameters for is ultimately simple to humans" (where by simple to humans we mean something like, description length in the prior of human knowledge, e.g., natural language).

If so, this hasn't been my experience doing interp work or from the interp work I've seen (... (read more)

4Evan Hubinger2mo
That's fair—perhaps “messy” is the wrong word there. Maybe “it's always weirder than you think”? (Edited the post to “weirder.”)

(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment - this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let's say I were somehow approximately-100% convinced that it's basically possible for iterative design to produce an AI. Then I'd expect AI is probably not an X-risk, but I still want to reduce the small remaining chance o... (read more)

Yeah, that's fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions.

What I actually think is that:

  • nearly all the probability mass is on worlds the iterative design loop fails to align AGI, but...
  • conditional on that being wrong, nearly all the probability mass is on the number of bits of optimization from iterative design resulting from ordinary economic/engineering activity being sufficient to align AGI, i.e. it is very unlikely that adding
... (read more)
2Richard Ngo2mo
In general I think it's better to reason in terms of continuous variables like "how helpful is the iterative design loop" rather than "does it work or does it fail"? My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it's 5% of worlds rather than 50% of worlds.

I think it would be much more interesting and helpful to exhibit a case of software with a vulnerability where it's really hard for someone to verify the claim that the vulnerability exists.

Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.

I get ... (read more)

7Paul Christiano2mo
* You can check whether there are examples where it takes an hour to notice a problem, or 10 hours, or 100 hours... You can check whether there are examples that require lots of expertise to evaluate. And so on. the question isn't whether there is some kind of magical example that is literally impossible to notice, it's whether there are cases where verification is hard relative to generation! * You can check whether you can generate examples, or whether other people believe that they can generate examples. The question is about whether a slightly superhuman AI can find examples, not whether they exist (and indeed whether they exist is more unfalsifiable, not because of the difficulty of recognizing them but because of the difficulty of finding them). * You can look for examples in domains where the ground truth is available. E.g. we can debate about the existence of bugs or vulnerabilities in software, and then ultimately settle the question by running the code and having someone demonstrate a vulnerability. If Alice claims something is a vulnerability but I can't verify her reasoning, then she can still demonstrate that it was correct by going and attacking the system. * I've looked at e.g. some results from the underhanded C competition [] and they are relatively easy for laypeople to recognize in a short amount of time when the attack is pointed out. I have not seen examples of attacks that are hard to recognize as plausible attacks without significant expertise or time, and I am legitimately interested in them. I'm bowing out here, you are welcome to the last word.

I've been using "natural abstraction" here as if it just means an abstraction that would be useful for a wide variety of agents to have in their toolbox. But we might also use "natural abstractions" to denote the vital abstractions, those that aren't merely nice to have, but that you literally can't complete certain tasks without using.

The thing I usually have in mind these days is stronger than the first but weaker than the second. Roughly speaking: natural abstractions should be convergent for distributed system produced by local selection pressures. Tha... (read more)

I don't think the generalization of the OP is quite "sometimes it's easier to create an object with property P than to decide whether a borderline instance satisfies property P". Rather, the halting example suggests that verification is likely to be harder than generation specifically when there is some (possibly implicit) adversary. What makes verification potentially hard is the part where we have to quantify over all possible inputs - the verifier must work for any input.

Borderline cases are an issue for that quantifier, but more generally any sort of a... (read more)

  • If including an error in a paper resulted in a death sentence, no one would be competent to write papers either.
  • For fraud, I agree that "tractable fraud has a meaningful probability of being caught," and not "tractable fraud has a very high probability of being caught." But "meaningful probability of being caught" is just what we need for AI delegation.
  • Verifying that arbitrary software is secure (even if it's actually secure) is much harder than writing secure software. But verifiable and delegatable work is still extremely useful for the process of writin
... (read more)

Yeah, ok, so I am making a substantive claim that the distribution is bimodal. (Or, more accurately, the distribution is wide and work on RLHF only counterfactually matters if we happen to land in a very specific tiny slice somewhere in the middle.) Those "middle worlds" are rare enough to be negligible; it would take a really weird accident for the world to end up such that the iteration cycles provided by ordinary economic/engineering activity would not produce aligned AI, but the extra iteration cycles provided by research into RLHF would produce aligned AI.

4Richard Ngo2mo
Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim ("I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1") suggests you don't actually think there's significant probability on the lower mode, you essentially think it's unimodal on the "iterative design fails" worlds. I personally disagree with both the "significant probability on both modes, but not in between" hypothesis, and the "unimodal on iterative design fails" hypothesis, but I think that it's important to be clear about which you're defending - e.g. because if you were defending the former, then I'd want to dig into what you thought the first mode would actually look like and whether we could extend it to harder cases, whereas I wouldn't if you were defending the latter.

In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them?

Wrong question. The point is not that RLHF can't be part of a solution, in such worlds. The point is that working on RLHF does not provide any counterfactual improvement to chances of survival, in such worlds.

Iterative design is something which happens automagically, for free, without any alignment researcher having to work on it. Customers see problems in their AI products, and companies are incentivized to fix them; that's iterative... (read more)

3Richard Ngo2mo
I think you're just doing the bimodal thing again. Sure, if you condition on worlds in which alignment happens automagically, then it's not valuable to advance the techniques involved. But there's a spectrum of possible difficulty, and in the middle parts there are worlds where RLHF works, but only because we've done a lot of research into it in advance (e.g. exploring things like debate); or where RLHF doesn't work, but finding specific failure cases earlier allowed us to develop better techniques.

The argument is not structurally invalid, because in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF. Working on RLHF does not particularly increase our chances of survival, in the worlds where RLHF doesn't make things worse.

That said, I admit that argument is not very cruxy for me. The cruxy part is that I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1. And I think the various examples/analogies in the post convey my main intuition-sources behind that claim. In particular, the excerpts/claims from Get What You Measure are pretty cruxy.

2Richard Ngo2mo
In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them? It seems pretty odd to explain this by quoting someone who thinks that this effect is dramatically less important than you do (i.e. nowhere near causing a ~100% probability of iterative design failing). Not gonna debate this on the object level, just flagging that this is very far from the type of thinking that can justifiably get you anywhere near those levels of confidence.

Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?

When talking about whether some physical system "is a utility maximizer", the key questions are "utility over what variables?", "in what model do those variables live?", and "with respect to what measuring stick?". My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I'm still highly uncertain what that type-signature will look like, but there's a lot of degrees of... (read more)

Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?

Basically no.

I'd like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/value learning. ...

I basically buy your argument, though there's still the question of how safe a target DWIM is.

Still on the "figure out agency and train up an aligned AGI unilaterally" path?

"Train up an AGI unilaterally" doesn't quite carve my plans at the joints.

One of the most common ways I see people fail to have any effect at all is to think in terms of "we". They come up with plans which "we" could follow, for some "we" which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in "we" implementing the plan. (And also, usually, the "we" in question is to... (read more)

Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?

Here's a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)

My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 yea... (read more)

My own responses to OpenAI's plan:

These are obviously not intended to be a comprehensive catalogue of the problems with OpenAI's plan, but I think they cover the most egregious issues.

3Arun Jose2mo
I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here. I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc. I don't think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI's approach because it views these as solutions in themselves, instead of as simplified problems.

I'd add that everything in this post is still relevant even if the AGI in question isn't explicitly modelling itself as in a simulation, attempting to deceive human operators, etc. The more-general takeaway of the argument is that certain kinds of distribution shift will occur between training and deployment - e.g. a shift to a "large reality", universe which embeds the AI and has simple physics, etc. Those distribution shifts potentially make training behavior a bad proxy for deployment behavior, even in the absence of explicit malign intent of the AI toward its operators.

One subtlety which I'd expect is relevant here: when two singular vectors have approximately the same singular value, the two vectors are very numerically unstable (within their span).

Suppose that two singular vectors have the same singular value. Then in the SVD, we have two terms of the form

(where the 's and 's are column vectors). That middle part is just the shared singular value  times a 2x2 identity matrix:

But the 2x2 identity matrix can be rewritten as a 2x2 rotation  ... (read more)

One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don't imply that the plan fails...

I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefe... (read more)

3Victoria Krakovna2mo
I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts. 
3Ramana Kumar2mo
I agree with you - and yes we ignore this problem by assuming goal-alignment. I think there's a lot riding on the pre-SLT model having "beneficial" goals.

I maybe want to stop saying "explicitly thinking about it" (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that "conscious thoughts" have deception in them) and instead say that "the AI system at some point computes some form of 'reason' that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action".

I don't quite agree with that as literally stated; a huge part of intelligence is finding... (read more)

4Rohin Shah3mo
Yeah, I don't think it's central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).

Two probable cruxes here...

First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a -2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it. (Note: I'm using "intelligence" here to point to something including ability to "actually try" as oppos... (read more)

4Rohin Shah3mo
This sounds roughly right to me, but I don't see why this matters to our disagreement? This also sounds plausible to me (though it isn't clear to me how exactly doom happens). For me the relevant question is "could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge", and I think the answer is still yes. I maybe want to stop saying "explicitly thinking about it" (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that "conscious thoughts" have deception in them) and instead say that "the AI system at some point computes some form of 'reason' that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action".

For that part, the weaker assumption I usually use is that AI will end up making lots of big and fast (relative to our ability to meaningfully react) changes to the world, running lots of large real-world systems, etc, simply because it's economically profitable to build AI which does those things. (That's kinda the point of AI, after all.)

In a world where most stuff is run by AI (because it's economically profitable to do so), and there's RLHF-style direct incentives for those AIs to deceive humans... well, that's the starting point to the Getting What Yo... (read more)

3Tom Everitt3mo
This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.

I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.

To me, Christiano's Get What You Measure scenario looks much more plausible a priori to be "what happens by default". For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions? Why additionally assume that we need consequentiali... (read more)

6Rohin Shah3mo
(Speaking just for myself in this comment, not the other authors) I still feel like the comments on your post [] are pretty relevant, but to summarize my current position: 1. AIs that actively think about deceiving us (e.g. to escape human oversight of the compute cluster they are running on) come well before (in capability ordering, not necessarily calendar time) AIs that are free enough from human-imposed constraints and powerful enough in their effects on the world that they can wipe out humanity + achieve their goals without thinking about how to deal with humans. 2. In situations where there is some meaningful human-imposed constraint (e.g. the AI starts out running on a data center that humans can turn off), if you don't think about deceiving humans at all, you choose plans that ask humans to help you with your undesirable goals, causing them to stop you. So, in these situations, x-risk stories require deception. 3. It seems kinda unlikely that even the AI free from human-imposed constraints like off switches doesn't think about humans at all. For example, it probably needs to think about other AI systems that might oppose it, including the possibility that humans build such other AI systems (which is best intervened on by ensuring the humans don't build those AI systems). Responding to this in particular: The least conjunctive story for doom is "doom happens". Obviously this is not very useful. We need more details in order to find solutions. When adding an additional concrete detail, you generally want that detail to (a) capture lots of probability mass and (b) provide some angle of attack for solutions. For (a): based on the points above I'd guess maybe 20:1 odds on "x-risk via misalignment with explicit deception" : "x-risk via misalignment without explicit deception" in our actual world. (Obvio
3Tom Everitt3mo
Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won't be systematically deceiving humans to pursue some particular agenda of the agent.  As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.
1Koen Holtman3mo
Me too, but note how the analysis leading to the conclusion above is very open about excluding a huge number of failure modes leading to x-risk from consideration first: In this context, I of course have to observe that any human decision, any decision to deploy an AGI agent that uses purely consequentialist planning towards maximising a simple metric, would be a very poor human decision to make indeed. But there are plenty of other poor decisions too that we need to worry about.

That counterargument does at least typecheck, so we're not talking past each other. Yay!

In the context of neurosymbolic methods, I'd phrase my argument like this: in order for the symbols in the symbolic-reasoning parts to robustly mean what we intended them to mean (e.g. standard semantics in the case of natural language), we need to pick the right neural structures to "hook them up to". We can't just train a net to spit out certain symbols given certain inputs and then use those symbols as though they actually correspond to the intended meaning, because ... (read more)

5David Scott Krueger3mo
Hmm I feel a bit damned by faint praise here... it seems like more than type-checking, you are agreeing substantively with my points (or at least, I fail to find any substantive disagreement with/in your response). Perhaps the main disagreement is about the definition of interpretability, where it seems like the goalposts are moving... you say (paraphrasing) "interpretability is a necessary step to robustly/correctly grounding symbols".  I can interpret that in a few ways: 1. "interpretability := mechanistic interpretability (as it is currently practiced)": seems false. 2. "interpretability := understanding symbol grounding well enough to have justified confidence that it is working as expected": also seems false; we could get good grounding without justified confidence, although it certainly much better to have the justified confidence. 3. "interpretability := having good symbol grounding": a mere tautology. A potential substantive disagreement: I think we could get high levels of justified confidence via means that look very different from (what I'd consider any sensible notion of) "interpretability", e.g. via:  * A principled understanding of how to train or otherwise develop systems that ground symbols in the way we want/expect/etc. * Empirical work * A combination of either/both of the above with mechanistic interpretability It's not clear that any of these or their combination will give us as high of levels of justified confidence as we would like, but that's just the nature of the beast (and a good argument for pursuing governance solutions). A few more points regarding symbol grounding: * I think it's not a great framing... I'm struggling to articulate why, but it's maybe something like "There is no clear boundary between symbols and non-symbols" * I think the argument I'm making in the original post applies equally well to grounding... There is some difficult work to be done and it is not clear that reverse engi

I think the argument in Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc applies here.

Carrying it over to the car/elephant analogy: we do not have a broken car. Instead, we have two toddlers wearing a car costume and making "vroom" noises. [Edit-To-Add: Actually, a better analogy would be a Flintstones car. It only looks like a car if we hide the humans' legs running underneath.] We have not ever actually built a car or anything even remotely similar to a car; we do not understand the principles of mechanics, thermodynamics or ch... (read more)

3David Scott Krueger3mo
Actually I really don't think it does... the argument there is that: * interpretability is about understanding how concepts are grounded. * symbolic methods don't tell us anything about how their concepts are grounded. This is only tangentially related to the point I'm making in my post, because: * A lot of interpretability is about discovering how concepts are used in a higher-level algorithm, and the argument doesn't apply there. * I am comparing mechanistic interpretability of neural nets with neuro-symbolic methods. * One point of using such methods is to enforce or encourage certain high-level algorithmic properties, e.g. modularity. 
3David Scott Krueger3mo
If this is true, then it makes (mechanistic) interpretability much harder as well, as we'll need our interpretability tools to somehow teach us these underlying principles, as you go on to say.  I don't think this is the primary stated motivation for mechanistic interpretability.  The main stated motivations seem to be roughly "We can figure out if the model is doing bad (e.g. deceptive) stuff and then do one or more of: 1) not deploy it, 2) not build systems that way, 3) train against our operationalization of deception"

Upvoted for content format - I would like to see more people do walkthroughs with their takes on a paper (especially their own), talking about what's under-appreciated, a waste of time, replication expectations, etc.

3Neel Nanda3mo
Thanks! I've been pretty satisfied by just how easy this was - one-shot recording, no prep, something I can do in the evenings when I'm otherwise pretty low energy. Yet making a product that seems good enough to be useful to people (even if it could be much better with more effort). I'm currently doing ones for the toy model paper and induction heads paper, and experimenting with recording myself while I do research. I'd love to see other people doing this kind of thing!

I do want to evoke BFS/DFS/MCTS/A*/etc here, because I want to make the point that those search algorithms themselves do not look like (what I believe to be most peoples' conception of) babble and prune, and I expect the human search algorithm to differ from babble and prune in many similar ways to those algorithms. (Which makes sense - the way people come up with things like A*, after all, is to think about how a human would solve the problem better and then write an algorithm which does something more like a human.)

2Alex Turner3mo
OK, then I once again feel confused about what this post is arguing as I remember it. (Don't feel the need to explain it as a reply to this comment, I guess I'll just reread if it becomes relevant later.)

To be clear, I don't think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don't think they're all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect "reward for proxy, get an agent which cares about the proxy"; there's lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so... (read more)

4Alex Turner3mo
The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.  My values are also risk-averse (I'd much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in "shard strength" after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.  (This isn't fully expressing my intuition, here, but I figured I'd say at least a little something to your comment right now)  I'm not going to go into most of the rest now, but: 1. I think that that does have to do with shards. Liking to drink coffee is the result of a shard, of a contextual influence on decision-making (the influence to drink coffee), and in particular activates in certain situations to pull me into a future in which I drank coffee.  2. I'm also fine considering "A person who is OK with other people drinking coffee" and anti-C: "a person with otherwise the same values but who isn't OK with other people drinking coffee." I think that the latter would inconvenience the former (to the extent that coffee was important to the former), but that they wouldn't become bitter enemies, that anti-C wouldn't kill the pro-coffee person because the value function was imperfectly aligned, that the pro-coffee person would still derive substantial value from that universe.  3. Possibly the anti-coffee value would even be squashed by the rest of anti-C's values, because the anti-coffee value wasn't reflectively endorsed by the rest of anti-C's values. That's another way in which I think anti-C can

Kudos for writing all that out. Part of the reason I left that comment in the first place was because I thought "it's Turner, if he's actually motivatedly cognitating here he'll notice once it's pointed out". (And, corollary: since you have the skill to notice when you are motivedly cognitating, I believe you if you say you aren't. For most people, I do not consider their claims about motivatedness of their own cognition to be much evidence one way or the other.) I do have a fairly high opinion of your skills in that department.

For the record: I welcome we

... (read more)

I think I have a complaint like "You seem to be comparing to a 'perfect' reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn't make sense.

I think this is close to our most core crux.

It seems to me that there are a bunch of standard arguments which you are ignoring because they're formulated in an old frame that you're trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you've instead thrown th... (read more)

5Alex Turner2mo
I agree that we may need to be quite skillful in providing "good"/carefully considered reward signals on the data distribution actually fed to the AI. (I also think it's possible we have substantial degrees of freedom there.) In this sense, we might need to give "robustly" good feedback. However, one intuition which I hadn't properly communicated was: to make OP's story go well, we don't need e.g. an outer objective which robustly grades every plan or sequence of events the AI could imagine, such that optimizing that objective globally produces good results. This isn't just good reward signals on data distribution (e.g. real vs fake diamonds), this is non-upwards-error reward signals in all AI-imaginable situations, which seems thoroughly doomed to me. And this story avoids at least that problem, which I am relieved by. (And my current guess is that this "robust grading" problem doesn't just reappear elsewhere, although I think there are still a range of other difficult problems remaining. See also my post Alignment allows "nonrobust" decision-influences and doesn't require robust grading [].) And so I might have been saying "Hey isn't this cool we can avoid the worst parts of Goodhart by exiting outer/inner as a frame" while thinking of the above intuition (but not communicating it explicitly, because I didn't have that sufficient clarity as yet). But maybe you reacted "??? how does this avoid the need to reliably grade on-distribution situations, it's totally nontrivial to do that and it seems quite probable that we have to." Both seem true to me! (I'm not saying this was the whole of our disagreement, but it seems like a relevant guess.)
6Alex Turner4mo
When I first read this comment, I incorrectly understood it to say somehing like "If you were actually trying, you'd have generated the exponential error model on your own; the fact that you didn't shows that you aren't properly thinking about old arguments." I now don't think that's what you meant. I think I finally[1]  understand what you did mean, and I think you misunderstood what my original comment [] was trying to say because I wrote poorly and stream-of-consciousness.  Most importantly, I wasn’t saying something like “‘errors’ can’t exist because outer/inner alignment isn’t my frame, ignore.” I meant to communicate the following points: 1. I don’t know what a “perfect” reward function is in the absence of outer alignment, else I would know how to solve diamond alignment. But I’m happy to just discuss deviations from a proposed labelling scheme. (This is probably what we were already discussing, so this wasn't meant to be a devastating rejoinder or anything.) 2. I’m not sure what you mean by the “exponential” model you mentioned elsewhere, or why it would be a fatal flaw if true. Please say more? (Hopefully in a way which makes it clear why your argument behaves differently in the presence of errors, because that would be one way to make your arguments especially legible to how I'm currently thinking about the situation.) 3. Given my best guess at your model (the exponential error model), I think your original comment [] seems too optimistic about my specific story (sure seems like exponential weighting would probably just break it, label errors or no) but too pessimistic about the story template (why is it a fatal flaw that can’t be fixed with a bit of additional t
3Alex Turner4mo
EDIT 2: The original comment was too harsh. I've struck the original below. Here is what I think I should have said: I think you raise a valuable object-level point here, which I haven't yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I'd appreciate if you wouldn't speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts. Warning: This comment, and your previous comment [], violate my comment section guidelines: "Reign of terror // Be charitable." You have made and publicly stated a range of unnecessary, unkind, and untrue inferences about my thinking process. You have also made non-obvious-to-me claims of questionable-to-me truth value, which you also treat as exceedingly obvious. Please edit these two comments to conform to my civility guidelines. (EDIT: Thanks. I look forward to resuming object-level discussion!)

Let me exaggerate the kind of "error rates" I think you're anticipating:

  • Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems. 
    • What's supposed to go wrong? The agent somewhat more strongly steers towards cut gems? 
  • Suppose I'm grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What's supposed to happen next? 

(If these errors aren't representative, can you please provide a concrete and plausible scenario?)

Both of these examples are are focused on one error type:... (read more)

6Alex Turner4mo
I want to talk about several points related to this topic. I don't mean to claim that you were making points directly related to all of the below bullet points. This just seems like a good time to look back and assess and see what's going on for me internally, here. This seems like the obvious spot to leave the analysis. * At the time of writing, I wasn't particularly worried about the errors you brought up.  * I am a little more worried now in expectation, both under the currently low-credence worlds where I end up agreeing with your exponential argument, and in the ~linear hypothesis worlds, since I think I can still search harder for worrying examples which IMO neither of us have yet proposed. Therefore I'll just get a little more pessimistic immediately, in the latter case. * If I had been way more worried about "reward behavior we should have penalized", I would have indeed just been less likely to raise the more worrying failure points, but not super less likely. I do assess myself as flawed, here, but not as that flawed.  * I think the typical outcome would be something like "TurnTrout starts typing a list full of weak flaws, notices a twinge of motivated reasoning, has half a minute of internal struggle and then types out the more worrisome errors, and, after a little more internal conflict, says that John has a good point and that he wants to think about it more."  * I could definitely buy that I wouldn't be that virtuous, though, and that I would need a bit of external nudging to consider the errors, or else a few more days on my own for the issue to get raised to cognitive-housekeeping. After that happened a few times, I'd notice the overall problem and come up with a plan to fix it. * Obviously, I have at this point noticed (at least) my counterfactual mistake in the nearby world where I already agreed with you, and therefore have a plan to fix and remov
4Alex Turner4mo
This doesn't seem dangerous to me. So the agent values both, and there was an event which differentially strengthened the looks-like-diamond shard (assuming the agent could tell the difference at a visual remove, during training), but there are lots of other reward events, many of which won't really involve that shard (like video games where the agent collects diamonds, or text rpgs where the agent quests for lots of diamonds). (I'm not adding these now, I was imagining this kind of curriculum before, to be clear—see the "game" shard.)  So maybe there's a shard with predicates like "would be sensory-perceived by naive people to be a diamond" that gets hit by all of these, but I expect that shard to be relatively gradient starved and relatively complex in the requisite way -> not a very substantial update. Not sure why that's a big problem.  But I'll think more and see if I can't salvage your argument in some form. I found this annoying. 

Why does the ensembling matter?

I could imagine a story where it matters - e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that's a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably ... (read more)

2Alex Turner2mo
I read this as "the activations and bidding behaviors of the shards will itself be imperfect, so you get the usual 'Goodhart' problem where highly rated plans are systematically bad and not what you wanted." I disagree with the conclusion, at least for many kinds of "imperfections."  Below is one shot at instantiating the failure mode you're describing. I wrote this story so as to (hopefully) contain the relevant elements. This isn't meant as a "slam dunk case closed", but hopefully something which helps you understand how I'm thinking about the issue and why I don't anticipate "and then the shards get Goodharted." Then this shard can be "goodharted" by actions which involve the creation of these bacteria diamonds at that time. There's a question, though, of whether the AI will actually consider these plans (so that it then actually bids on this plan, which is rated spuriously highly from our perspective). The AI knows, abstractly, that considering this plan would lead it to bid for that plan. But it seems to me like, since generating that plan is reflectively predicted to not lead to diamonds (nor does it activate the specific bidding-behavior edge case the agent abstractly knows about), the agent doesn't pursue that plan.  This was one of the main ideas I discussed in Alignment allows "nonrobust" decision-influences and doesn't require robust grading []:  This suggests "and so what is an 'adversarial input' to the values, then? What intensional rule governs the kinds of high-scoring plans which internal reasoning will decide to not evaluate in full?". I haven't answered that question yet on an intensional basis, but it seems tractable.
4Alex Turner4mo
I think there's something like "why are human values so 'reasonable', such that [TurnTrout inference alert!] someone can like coffee and another person won't and that doesn't mean they would extrapolate into bitter enemies until the end of Time?", and the answer seems like it's gonna be because they don't have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you're near a diamond, ...), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.  I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they're from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don't think I've properly communicated my feelings in this comment, but hopefully it's better than nothing))
2Alex Turner4mo
I think this won't happen FWIW. Can you provide a concrete instantiation of this argument? (ETA: struck this part, want to hear your response first to make sure it's engaging with what you had in mind) 1. What about your argument behaves differently in the presence of humans and AI? This is clearly not how shard dynamics work in people, as I understand your argument.  2. We aren't in the prediction regime, insofar as that is supposed to be relevant for your argument. Let's talk about the batch update, and not make analogies to predictions. (Although perhaps I was the one who originally brought it up in OP, I should rewrite that.) 3. Can you give me a concrete example of an "exploiting shard" in this situation which is learnable early on, relative to the actual diamond-shards? The point I am arguing (ETA and I expect Quintin is as well, but maybe not) is that this will be one of the primary shards produced, not that there's a chance it exists at low weight or something. 

The agent already has the diamond abstraction from SSL+IL, but not the labelling process (due to IID training, and it having never seen our "labelling" before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)

I think there's a... (read more)

2Alex Turner4mo
I think I have a complaint like "You seem to be comparing to a 'perfect' reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn't make sense. A good reward schedule will put diamond-aligned cognition in the agent. It seems like, for you to be saying there's a 'fatal' flaw here due to 'errors', you need to make an argument about the cognition which trains into the agent, and how the AI's cognition-formation behaves differently in the presence of 'errors' compared to in the absence of 'errors.' And I don't presently see that story in your comments thus far. I don't understand what 'perfect labeling' is the thing to talk about, here, or why it would ensure your shard-formation counterarguments don't hold." (Will come by for lunch and so we can probably have a higher-context discussion about this! :) )
2Alex Turner4mo
This is already my model and was intended as part of my communicated reasoning. Why do you think it's an error in my reasoning? You'll notice I argued "If diamond", and about hooking that diamond predicate into its approach-subroutines (learned via IL). (ETA: I don't think you need a self-model to approach a diamond, or to "value" that in the appropriate sense. To value diamonds being near you, you can have representations of the space nearby, so you need a nearby representation, perhaps.) I think this is not the right term to use, and I think it might be skewing your analysis. This is not a supervised learning regime with exact gradients towards a fixed label. The question is what gets upweighted by the batch PG gradients, batching over the reward events. Let me exaggerate the kind of "error rates" I think you're anticipating: * Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.  * What's supposed to go wrong? The agent somewhat more strongly steers towards cut gems?  * Suppose I'm grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What's supposed to happen next?  (If these errors aren't representative, can you please provide a concrete and plausible scenario?)
3Lawrence Chan4mo
My impression is that 2 and 4 are relatively cruxy for some people? Especially 2.  IE I've heard from some academics that the "natural" thing to do is to join with the AI ethics crowd/Social Justice crowd and try to get draconian anti tech/anti AI regulations passed. My guess is their inside view beliefs are some combination of: A. Current tech companies are uniquely good at AI research relative to their replacements. IE, even if the US government destroys $10b of current industry RnD spending, and then spends $15b on AI research, this is way less effective at pushing AGI capabilities.  B. Investment in AI research happens in large part due to expectation of outsized profits. Destroy expectation of outsized profits via draconian anti innovation/anti market regulation or just by tacking on massive regulatory burdens (which the US/UK/EU governments are very capable of doing) is enough to curb research interest in this area significantly.  C. There's no real pressure from Chinese AI efforts. IE, delaying current AGI progress in the US/UK by 3 years just actually delays AGI by 3 years. More generally, there aren't other relevant players besides big, well known US/UK labs. (I don't find 2 super plausible myself, so I don't have a great inside view of this. I am trying to understand this view better by talking to said academics. In particular, even if C is true (IE China not an AI threat), the US federal government certainly doesn't believe this and is very hawkish vs China + very invested in throwing money at, or at least not hindering, tech research it believes is necessary for competition.) As for 4, this is a view I hear a lot from EA policy people? e.g. we used to make stupid mistakes, we're definitely not making them now; we used to just all be junior, now we have X and Y high ranking positions; and we did a bunch of experimentation and we figured out what messaging works relatively better. I think 4 would be a crux for me, personally - if our current efforts

Yup, that's a valid argument. Though I'd expect that gradient hacking to the point of controlling the reinforcement on one's own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).

3Alex Turner4mo
(agreed, for the record. I do think the agent can gradient starve the label-shard in story 2, though, without fancy reflective capability.)

I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too "distant"/complex for the AI to learn in early training, but insofar as there are simple patterns to the humans' labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It's like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it's grabbing the ball.

3Quintin Pope4mo
I think something like what you're describing does occur, but my view of SGD is that it's more "ensembly" than that. Rather than "the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard", I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is). Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut and shiny diamonds over unpolished diamonds or giant slabs of pure diamond. This is because I expect human labelers to be strongly biased towards the human conception of diamonds as pieces of art, over any configuration of matter with the chemical composition of a diamond.

First, I think most of the individual pieces of this story are basically right, so good job overall. I do think there's at least one fatal flaw and a few probably-smaller issues, though.

The main fatal flaw is this assumption:

Since “IF diamond”-style predicates do in fact “perfectly classify” the positive/negative approach/don’t-approach decision contexts...

This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.

I'm mostly used to thinking about this in the context of alignment with huma... (read more)

6Alex Turner4mo
Not crucial on my model. I'm imagining us watching the agent and seeing whether it approaches an object or not. Those are the "labels." I'm imagining this taking place between 50-1000 times. Before seeing this comment, I edited the post to add:  So, probably I shouldn't have written "perfectly", since that isn't actually load-bearing on my model. I think that there's a rather smooth relationship between "how good you are at labelling" and "the strength of desired value you get out" (with a few discontinuities at the low end, where perhaps a sufficiently weak shard ends up non-reflective, or not plugging into the planning API, or nonexistent at all). On that model, I don't really understand the following: The agent already has the diamond abstraction from SSL+IL, but not the labelling process (due to IID training, and it having never seen our "labelling" before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote [].) I agree that "diamond synthesis" is not directly rewarded, and if we wanted to ensure that happens, we could add that to the curriculum, as you note. But I think it would probably happen anyways, due to the expected-by-me "grabby" nature of the acquire-subshard. (Consider that I think it'd be cool to make dyson swarms, but I've never been rewarded for making dyson swarms.) So maybe the crux here is that I don't yet share your doubt of the acquisition-shard. I think that "are we directly rewarding the behavior which we want the desired shards to exemplify?" is a reasonable heuristic. I think that "What happens if the agent
4Charles Foster4mo
Not the OP but this jumped out at me: This failure mode seems plausible to me, but I can think of a few different plausible sequences of events that might occur, which would lead to different outcomes, at least in the shard lens. Sequence 1: * The agent develops diamond-shard * The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned * The agent exploits the gaps between the diamond-concept and the label-process-concept, which reinforces the label-process-shard within it * The label-process-shard drives the agent to continue exploiting the above gap, eventually (and maybe rapidly) overtaking the diamond-shard * So the agent's values drift away from what we intended. Sequence 2: * The agent develops diamond-shard * The diamond-shard becomes part of the agent's endorsed preferences (the goal-content it foresightedly plans to preserve) * The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned * The agent understands that if it exploited the gaps between the diamond-concept and the label-process-concept, it would be reinforced into developing a label-process-shard that would go against its endorsed preference for diamonds (ie. its diamond-shard), so it chooses not exploit that gap, in order to avoid value drift. * So agent continues to value diamonds in spite of the imperfect labeling process These different sequences of events would seem to lead to different conclusions about whether imperfections in the labeling process are fatal.
6Quintin Pope4mo
I don't think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like "infer the existence / true nature of distant latent generators that explain your observations" are actually incredibly difficult for neural learning processes (human or AI). Empirically, SGD is perfectly willing to memorize deviations from a simple predictor [], rather than generalize to a more complex predictor. Current ML would look very different if inferences like that were easy to make (and science would be much easier for humans). Even when a distant latent generator is inferred, it is usually not the correct generator, and usually just memorizes observations in a slightly more efficient way by reusing current abstractions. E.g., religions which suppose that natural disasters are the result of a displeased, agentic force.

Indeed. Unfortunately, I didn't catch that when skimming.

One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents?), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out.

Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards w... (read more)

4Alex Turner4mo
Even on the view you advocate here (where some kind of perfection is required), "perfectly align part of the motivations" seems substantially easier than "perfectly align all of the AI's optimization so it isn't optimizing for anything you don't want." I feel significantly less confident about this, and am still working out the degree to which Goodhart seems hard, and in what contours, on my current view.
Load More