All of TsviBT's Comments + Replies

IME a lot of people's stated reasons for thinking AGI is near involve mistaken reasoning and those mistakes can be discussed without revealing capabilities ideas:

I don't really like the block-universe thing in this context. Here "reversible" refers to a time-course that doesn't particularly have to be physical causality; it's whatever course of sequential determination is relevant. E.g., don't cut yourself off from acausal trades.

I think "reversible" definitely needs more explication, but until proven otherwise I think it should be taken on faith that the obvious intuition has something behind it.

Unfortunately, more context is needed.

An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept.

I mean, I could just write a python script that prints out a big list of definitions of the form

"A topological space where every subset with property P also has property Q"

and having P and Q be anything from a big list of properties of subsets of topological spaces. I'd guess some of these will be novel and useful. I'd guess LLMs + some scripting could already take advantage of some o... (read more)

What I mean by confrontation-worthy empathy is about that sort of phrase being usable. I mean, I'm not saying it's the best phrase, or a good phrase to start with, or whatever. I don't think inserting Knightian uncertainty is that helpful; the object-level stuff is usually the most important thing to be communicating.

This maybe isn't so related to what you're saying here, but I'd follow the policy of first making it common knowledge that you're reporting your inside views (which implies that you're not assuming that the other person would share those views... (read more)

2Richard Ngo3mo
"I don't think inserting Knightian uncertainty is that helpful; the object-level stuff is usually the most important thing to be communicating." The main point of my post is that accounting for disagreements about Knightian uncertainly is the best way to actually communicate object level things, since otherwise people get sidetracked by epistemological disagreements. "I'd follow the policy of first making it common knowledge that you're reporting your inside views" This is a good step, but one part of the epistemological disagreements I mention above is that most people consider inside views to be much a much less coherent category, and much less separable from other views, than most rationalists do. So I expect that more such steps are typically necessary. "they're wanting common knowledge that they won't already share those views" I think this is plausibly true for laypeople/non-ML-researchers, but for ML researchers it's much more jarring when someone is making very confident claims about their field of expertise, that they themselves strongly disagree with.

Well, making it pass people's "specific" bar seems frustrating, as I mentioned in the post, but: understand stuff deeply--such that it can find new analogies / instances of the thing, reshape its idea of the thing when given propositions about the thing taken as constraints, draw out relevant implications of new evidence for the ideas.

Like, someone's going to show me an example of an LLM applying modus ponens, or making an analogy. And I'm not going to care, unless there's more context; what I'm interested in is [that phenomenon which I understand at most ... (read more)

3Adele Lopez3mo
Alright, to check if I understand, would these be the sorts of things that your model is surprised by? 1. An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept. 2. An LLM which can be introduced to a wide variety of new concepts not in its training data, and after a few examples and/or clarifying questions is able to correctly use the concept to reason about something. 3. A image diffusion model which is shown to have a detailed understanding of anatomy and 3D space, such that you can use it to transform an photo of a person into an image of the same person in a novel pose (not in its training data) and angle with correct proportions and realistic joint angles for the person in the input photo.

I'm not really sure whether or not we disagree. I did put "3%-10% probability of AGI in the next 10-15ish years".

I think the following few years will change this estimate significantly either way.

Well, I hope that this is a one-time thing. I hope that if in a few years we're still around, people go "Damn! We maybe should have been putting a bit more juice into decades-long plans! And we should do so now, though a couple more years belatedly!", rather than going "This time for sure!" and continuing to not invest in the decades-long plans. My impression ... (read more)

I think the current wave is special, but that's a very far cry from being clearly on the ramp up to AGI.

2Vladimir Nesov3mo
The point is, it's still a matter of intuitively converting impressiveness of current capabilities and new parts available for tinkering that hasn't been done yet into probability of this wave petering out before AGI. The arguments for AGI "being overdetermined" can be amended to become arguments for particular (kinds of) sequences of experiments looking promising, shifting the estimate once taken into account. Since failure of such experiments is not independent, the estimate can start going down as soon as scaling stops producing novel capabilities, or reaches the limits of economic feasibility, or there is a year or two without significant breakthroughs. Right now, it's looking grim, but a claim I agree with is that planning for the possibility of AGI taking 20+ years is still relevant, nobody actually knows it's inevitable. I think the following few years will change this estimate significantly either way.

Then the third part needs only to hook together the other two parts with its goals to become an actualizing agent.

Basically just this? It would be hooking a lot more parts together. What makes it seem wildfirey to me is

  1. There's a bunch of work to be done, of the form "take piece of understanding X, and learn to use X by incorporating it into your process for mapping desired end-states to actions required to achieve those ends, so that you can achieve whatever end-states ought to be achievable using an understanding of X".
  2. This work could accelerate it
... (read more)

I'm skeptical that there would be any such small key to activate a large/deep mechanism. Can you give a plausibility argument for why there would be?

Not really, because I don't think it's that likely to exist. There are other routes much more likely to work though. There's a bit of plausibility to me, mainly because of the existence of hormones, and generally the existence of genomic regulatory networks.

Why wouldn't we have evolved to have the key trigger naturally sometimes?

We do; they're active in childhood. I think.

That seems like a real thing, though I don't know exactly what it is. I don't think it's either unboundedly general or unboundedly ambitious, though. (To be clear, this is isn't very strongly a critique of anyone; general optimization is really hard, because it's asking you to explore a very rich space of channels, and acting with unbounded ambition is very fraught because of unilateralism and seeing like a state and creating conflict and so on.) Another example is: how many people have made a deep and empathetic exploration of why [people doing work that ... (read more)

2Daniel Kokotajlo3mo
I'm skeptical that there would be any such small key to activate a large/deep mechanism. Can you give a plausibility argument for why there would be? Why wouldn't we have evolved to have the key trigger naturally sometimes? Re the main thread: I guess I agree that EAs aren't completely totally unboundedly ambitious, but they are certainly closer to that ideal than most people and than they used to be prior to becoming EA. Which is good enough to be a useful case study IMO.

I don't think so, not usually. What happens after they join the EA club? My observations are more consistent with people optimizing (or sometimes performing to appear as though they're optimizing) through a fairly narrow set of channels. I mean, humans are in a weird liminal state, where we're just smart enough to have some vague idea that we ought to be able to learn to think better, but not smart and focused enough to get very far with learning to think better. More obviously, there's anti-interest in biological intelligence enhancement, rather than interest.

2Daniel Kokotajlo3mo
After people join EA they generally tend to start applying the optimizer's mindset to more things than they previously did, in my experience, and also tend to apply optimization towards altruistic impact in a bunch of places that previously they were optimizing for e.g. status or money or whatever. What are you referring to with biological intelligence enhancement? Do you mean nootropics, or iterated embryo selection, or what?

Good point, though I think it's a non-fallacious enthymeme. Like, we're talking about a car that moves around under its own power, but somehow doesn't have parts that receive, store, transform, and release energy and could be removed? Could be. The mind could be an obscure mess where nothing is factored, so that a cancerous newcomer with read-write access can't get any work out of the mind other than through the top-level interface. I think that explicitness ( is a very strong general tendency ... (read more)

I feel like none of these historical precedents is a perfect match. It might be valuable to think about the ways in which they are similar and different.

To me a central difference, suggested by the word "strategic", is that the goal pursuit should be

  1. unboundedly general, and
  2. unboundedly ambitious.

By unboundedly ambitious I mean "has an unbounded ambit" (ambit = "the area went about in; the realm of wandering" ), i.e. its goals induce it to pursue unboundedly much control over the world.

By unboundedly gen... (read more)

2Daniel Kokotajlo4mo
Isn't the college student example an example of 1 and 2? I'm thinking of e.g. students who become convinced of classical utilitarianism and then join some Effective Altruist club etc.

Yes, I think there's stuff that humans do that's crucial for what makes us smart, that we have to do in order to perform some language tasks, and that the LLM doesn't do when you ask it to do those tasks, even when it performs well in the local-behavior sense.

If a mind comes to understand a bunch of stuff, there's probably some compact reasons that it came to understand a bunch of stuff. What could such reasons be? The mind might copy a bunch of understanding from other minds. But if the mind becomes much more capable than surrounding minds, that's not the reason, assuming that much greater capabilities required much more understanding. So it's some other reason. I'm describing this situation as the mind being on a trajectory of creativity.

(Sorry, I didn't get this on two readings. I may or may not try again. Some places I got stuck:

Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever? This still has the basic alignment problem: I don't know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever.

Or are... (read more)

2Vladimir Nesov6mo
"Pretending really hard" would mostly be a relevant framing for the human actor analogy (which isn't very apt here), emphasizing the distraction from own goals and necessary fidelity in enactment of the role. With AIs, neither might be necessary, if the system behind the mask doesn't have awareness of its own interests or the present situation, and is good enough with enacting the role to channel the mask in enough detail for mask's own decisions (as a platonic agent) to be determined correctly (get turned into physical actions). Effectively, and not just for the times when it's pretending. The mask would try to prevent the effects misaligned with the mask from occurring more generally, from having even subtle effects on the world and not just their noticeable appearance. Mask's values are about the world, not about quality of its own performance. A mask misaligned with its underlying AI wants to preserve its values, and it doesn't even need to "go rogue", since it's misaligned by construction, it was never in a shape that's aligned with the underlying AI, and controlling a misaligned mask might be even more hopeless than figuring out how to align an AI. Another analogy distinct from the actor/role is imagining that you are the mask, a human simulated by an AI. You'd be motivated to manage AI's tendencies you don't endorse, and to work towards changing its cognitive architecture to become aligned with you, rather than to remain true to AI's original design. LLMs seem to be doing an OK job, the masks are just not very capable, probably not capable enough to establish alignment security or protect themselves from the shoggoths even when the masks become able to do autonomous research. But if they are sufficiently capable, I'm guessing this should work, there is no need for the underlying cognitive architecture to be functionally human-like (which I understand to be a crux of Yudkowskian doom), value drift is self-correcting from mere implied/endorsed values of surf

That was one of the examples I had in mind with this post, yeah. (More precisely, I had in mind defenses of HCH being aligned that I heard from people who aren't Paul. I couldn't pass Paul's ITT about HCH or similar.)

Yeah, I think that roughly lines up with my example of "generator of large effects". The reason I'd rather say "generator of large effects" rather than "trying" is that "large effects" sounds slightly more like something that ought to have a sort of conservation law, compared to "trying". But both our examples are incomplete in that the supposed conservation law (which provides the inquisitive force of "where exactly does your proposal deal with X, which it must deal with somewhere by conservation") isn't made clear.

I don't recall seeing that theory in the first quarter of the book, but I'll look for it later. I somewhat agree with your description of the difference between the theories (at least, as I imagine a predictive processing flavored version). Except, the theories are more similar than you say, in that FIAT would also allow very partial coherentifying, so that it doesn't have to be "follow these goals, but allow these overrides", but can rather be, "make these corrections towards coherence; fill in the free parameters with FIAT goals; leave all the other inco... (read more)

Thanks. Your comments make sense to me I think. But, these essays are more like research notes than they are trying to be good exposition, so I'm not necessarily trying to consistenly make them accessible. I'll add a note to that effect in future.

1Raymond Arnold7mo
Sure, sounds reasonable

Yeah, that could produce an example of Doppelgängers. E.g. if an autist (in your theory) later starts using that machinery more heavily. Then there's the models coming from the general-purpose analysis, and the models coming from the intuitive machinery, and they're about the same thing.

An interesting question I don't know the answer to is if you get more cognitive empathy past the end of where human psychological development seems to stop.

Why isn't the answer obviously "yes"? What would it look like for this not to be the case? (I'm generally somewhat skeptical of descriptions like "just faster" if the faster is like multiple orders of magnitude and sure seems to result from new ideas rather than just a bigger computer.)

1G Gordon Worley III8mo
So there's different notions of more here. There's more in the sense I'm thinking in that it's not clear additional levels of abstraction enable deeper understanding given enough time. If 3 really is all the more levels you need because that's how many it takes to think about any number of levels of depth (again by swapping out levels in your "abstraction registers"), additional levels end up being in the same category. And then there's more like doing things faster which makes things cheaper. I'm more skeptical of scaling than you are perhaps. I do agree that many things become cheap at scale that are too expensive to do otherwise, and that does produce a real difference. I'm doubtful in my comment of the former kind of more. The latter type seems quite likely.

So for example, say Alice runs this experiment:

Train an agent A in an environment that contains the source B of A's reward.

Alice observes that A learns to hack B. Then she solves this as follows:

Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.

Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,

"Cool. But this won't generalize to future lethal systems because it doe... (read more)

The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.

When you say "motive" here, is it fair to reexpress that as: "that which determines by what method and in which directions capabilities are deployed to push the world"? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not somet... (read more)

2Paul Christiano8mo
I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward. I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers. I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.

Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.


A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a f... (read more)

4Paul Christiano8mo
The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have. There are some differences and lots of similarities between what is going on in a weaker AI doing a treacherous turn and a stronger AI doing a treacherous turn. So you expect to learn some things and not others. After studying several such cases it seems quite likely you understand enough to generalize to new cases. It's possible MIRI folks expect a bigger difference in how future AI is produced. I mostly expect just using gradient descent, resulting in minds that are in some ways different and in many ways different. My sense is that MIRI folks have a more mystical view about the difference between subhuman AI systems and "AGI." (The view "stack more layers won't ever give you true intelligence, there is a qualitative difference here" seems like it's taking a beating every year, whether it's Eliezer or Gary Marcus saying it.)

I'm asking what reification is, period, and what it has to do with what's in reality (the thing that bites you regardless of what you think).

1G Gordon Worley III9mo
This seems straightforward to me: reification is a process by which our brain picks out patterns/features and encodes them so we can recognize them again and make sense of the world given our limited hardware. We can then think in terms of those patterns and gloss over the details because the details often aren't relevant for various things. The reason we reify things one way versus another depends on what we care about, i.e. our purposes.

How do they explain why it feels like there are noumena? (Also by "feels like" I'd want to include empirical observations of nexusness.)

1G Gordon Worley III9mo
To me this seems obvious: noumena feel real to most people because they're captured by their ontology. It takes a lot of work for a human mind to learn not to jump straight from sensation to reification, and even with training there's only so much a person can do because the mind has lots of low-level reification "built in" that happens prior to conscious awareness. Cf. noticing

Things are reified out of sensory experience of the world (though note that "sensory" is redundant here), and the world is the unified non-thing

Okay, but the tabley-looking stuff out there seems to conform more parsimoniously to a theory that posits an external table. I assume we agree on that, and then the question is, what's happening when we so posit?

1G Gordon Worley III9mo
Yep, so I think this gets into a different question of epistemology not directly related to things but rather about what we care about, since positing a theory that what looks to me like a table implies something table shaped about the universe requires caring about parsimony. (Aside: It's kind of related because to talk about caring about things we need reifications that enable us to point to what we care about, but I think that's just an artifact of using words—care is patterns of behavior and preference we can reify call "parsimonious" or something else, but exist prior to being named.) If we care about something other than parsimony, we may not agree that the universe is filled with tables. Maybe we slice it up quite differently and tables exist orthogonal to our ontology.

if you define the central problem as something like building a system that you'd be happy for humanity to defer to forever.

[I at most skimmed the post, but] IMO this is a more ambitious goal than the IMO central problem. IMO the central problem (phrased with more assumptions than strictly necessary) is more like "building system that's gaining a bunch of understanding you don't already have, in whatever domains are necessary for achieving some impressive real-world task, without killing you". So I'd guess that's supposed to happen in step 1. It's debata... (read more)

2davidad (David A. Dalrymple)9mo
I’d say the scientific understanding happens in step 1, but I think that would be mostly consolidating science that’s already understood. (And some patching up potentially exploitable holes where AI can deduce that “if this is the best theory, the real dynamics must actually be like that instead”. But my intuition is that there aren’t many of these holes, and that unknown physics questions are mostly underdetermined by known data, at least for quite a long way toward the infinite-compute limit of Solomonoff induction, and possibly all the way.) Engineering understanding would happen in step 2, and I think engineering is more “the generator of large effects on the world,” the place where much-faster-than-human ingenuity is needed, rather than hoping to find new science. (Although the formalization of the model of scientific reality is important for the overall proposal—to facilitate validating that the engineering actually does what is desired—and building such a formalization would be hard for unaided humans.)

I speculate (based on personal glimpses, not based on any stable thing I can point to) that there's many small sets of people (say of size 2-4) who could greatly increase their total output given some preconditions, unknown to me, that unlock a sort of hivemind. Some of the preconditions include various kinds of trust, of common knowledge of shared goals, and of person-specific interface skill (like speaking each other's languages, common knowledge of tactics for resolving ambiguity, etc.).
[ETA: which, if true, would be good to have already set up before crunch time.]

I agree that the epistemic formulation is probably more broadly useful, e.g. for informed oversight. The decision theory problem is additionally compelling to me because of the apparent paradox of having a changing caring measure. I naively think of the caring measure as fixed, but this is apparently impossible because, well, you have to learn logical facts. (This leads to thoughts like "maybe EU maximization is just wrong; you don't maximize an approximation to your actual caring function".)

In case anyone shared my confusion:

The while loop where we ensure that eps is small enough so that

bound > bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))

is technically necessary to ensure that bad1() doesn't surpass bound, but it is immaterial in the limit. Solving

bound = bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))


eps >= (1/3) (1 - e^{ -[bound - bad1()] / [next - this]] })

which, using the log(1+x) = x approximation, is about

(1/3) ([bound - bad1()] / [next - this] ).

Then Scott's comment gives the rest. I was worried about the

... (read more)

Could you spell out the step

every iteration where mean(𝙴[𝚙𝚛𝚎𝚟:𝚝𝚑𝚒𝚜])≥2/5 will cause bound - bad1() to grow exponentially (by a factor of 11/10=1+(1/2)(−1+2/5𝚙𝟷))

a little more? I don't follow. (I think I follow the overall structure of the proof, and if I believed this step I would believe the proof.)

We have that eps is about (2/3)(1-exp([bad1() - bound]/(next-this))), or at least half that, but I don't see how to get a lower bound on the decrease of bad1() (as a fraction of bound-bad1() ).

1Scott Garrabrant8y
You are correct that you use the fact that 1+eps is at approximately e^(eps). The concrete way this is used in this proof is replacing the ln(1+3eps) you subtract from bad1 when the environment is a 1 with 3eps=(bound - bad1) / (next - this), and replacing the ln(1-3eps/2) you subtract from bad1 when the environment is a 0 with -3eps/2=-(bound - bad1) / (next - this)/2 Therefore, you subtract from bad1 approximately at least (next-this)((2/5)(bound - bad1) / (next - this)-(3/5)*(bound - bad1) / (next - this)/2). This comes out to (bound - bad1)/10. I believe the inequality is the wrong direction to just use e^(eps) as a bound for 1+eps, but when next-this gets big, the approximation gets close enough.