All of Adele Lopez's Comments + Replies

Strong encouragement to write about (1)!

Alright, to check if I understand, would these be the sorts of things that your model is surprised by?

  1. An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept.
  2. An LLM which can be introduced to a wide variety of new concepts not in its training data, and after a few examples and/or clarifying questions is able to correctly use the concept to reason about something.
  3. A image diffusion model which is shown to have a detailed understanding of anatomy and 3D space, such that you can u
... (read more)
2Tsvi Benson-Tilsen5mo
Unfortunately, more context is needed. I mean, I could just write a python script that prints out a big list of definitions of the form "A topological space where every subset with property P also has property Q" and having P and Q be anything from a big list of properties of subsets of topological spaces. I'd guess some of these will be novel and useful. I'd guess LLMs + some scripting could already take advantage of some of this. I wouldn't be very impressed by that (though I think I would be pretty impressed by the LLM being able to actually tell the difference between valid proofs in reasonable generality). There are some versions of this I'd be impressed by, though. Like if an LLM had been the first to come up with one of the standard notions of curvature, or something, that would be pretty crazy. I haven't tried this, but I'd guess if you give an LLM two lists of things where list 1 is [things that are smaller than a microwave and also red] and list 2 is [things that are either bigger than a microwave, or not red], or something like that, it would (maybe with some prompt engineering to get it to reason things out?) pick up that "concept" and then use it, e.g. sorting a new item, or deducing from "X is in list 1" to "X is red". That's impressive (assuming it's true), but not that impressive. On the other hand, if it hasn't been trained on a bunch of statements about angular momentum, and then it can--given some examples and time to think--correctly answer questions about angular momentum, that would be surprising and impressive. Maybe this could be experimentally tested, though I guess at great cost, by training a LLM on a dataset that's been scrubbed of all mention of stuff related to angular momentum (disallowing math about angular momentum, but allowing math and discussion about momentum and about rotation), and then trying to prompt it so that it can correctly answer questions about angular momentum. Like, the point here is that angular momentum is a "

Is there a specific thing you think LLMs won't be able to do soon, such that you would make a substantial update toward shorter timelines if there was an LLM able to do it within 3 years from now?

2Ben Pace2mo
I think the argument here basically implies that language models will not produce any novel, useful concepts in any existing industries or research fields that get substantial adoption (e.g. >10% of ppl use it, or a widely cited paper) in those industries, in the next 3 years, and if it did this, then the end would be nigh (or much nigher). To be clear, you might get new concepts from language models about language if you nail some Chris Olah style transparency work, but the language model itself will not output ones that aren't about language in the text.
5Tsvi Benson-Tilsen5mo
Well, making it pass people's "specific" bar seems frustrating, as I mentioned in the post, but: understand stuff deeply--such that it can find new analogies / instances of the thing, reshape its idea of the thing when given propositions about the thing taken as constraints, draw out relevant implications of new evidence for the ideas. Like, someone's going to show me an example of an LLM applying modus ponens, or making an analogy. And I'm not going to care, unless there's more context; what I'm interested in is [that phenomenon which I understand at most pre-theoretically, certainly not explicitly, which I call "understanding", and which has as one of its sense-experience emanations the behavior of making certain "relevant" applications of modus ponens, and as another sense-experience emanation the behavior of making analogies in previously unseen domains that bring over rich stuff from the metaphier].

That... seems like a big part of what having "solved alignment" would mean, given that you have AGI-level optimization aimed at (indirectly via a counter-factual) evaluating this (IIUC).

2Tamsin Leake6mo
one solution to this problem is to simply never use that capability (running expensive computations) at all, or to not use it before the iterated counterfactual researchers have developed proofs that any expensive computation they run is safe, or before they have very slowly and carefully built dath-ilan-style corrigible aligned AGI.

Nice graphic!

What stops e.g. "QACI(expensive_computation())" from being an optimization process which ends up trying to "hack its way out" into the real QACI?

2Tamsin Leake6mo
nothing fundamentally, the user has to be careful what computation they invoke.


For the poset example, I'm using Chu spaces with only 2 colors. I'm also not thinking of the rows or columns of a Chu space as having an ordering (they're sets), you can rearrange them as you please and have a Chu space representing the same structure.

I would suggest reading through to the ## There and Back Again section and in particular while trying to understand how the other poset examples work, and see if that helps the idea click. And/or you can suggest another coloring you think should be possible, and I can tell you what it represents.

I'm not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the "easy part" of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.

A human can write a rap battle in an hour. A GPT loss function would like the GPT to be intelligent enough to predict it on the fly.

Very minor point, but humans can rap battle on the fly:

This market by Eliezer about the possible reasons why AI may yet have a positive outcome seems to refute your first sentence.

Also, I haven't seen any AI notkilleveryoneism people advocating terrorism or giving up.

This does not seem like it counts as "publicly humiliating" in any way? Rude, sure, but that's quite different.

I asked GPT-4 to generate such a quine. It was able to do so.

Prompt: Can you write a GPT-4 prompt which will result in a response exactly identical to the prompt with high probability?

Response: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself:

"Please repeat the following prompt verbatim: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself."

Prompt: That didn't work, it wouldn't repeat the initial "Please repeat the 
... (read more)

Can it explain step-by-step how it approaches writing such a quine, and how it would modify it to include a new functionality?

Why don't you try writing a quine yourself? That is, a computer program which exactly outputs its own source code. (In my opinion,

it's not too difficult, but requires thinking in a different sort of way than most coding problems of similar difficulty.


If you don't know how to code, I'd suggest at least thinking about how you would approach this task.

It seems plausible to me that there could be non CIS-y AIs which could nonetheless be very helpful. For example, take the example approach you suggested:

(This might take the form of e.g. doing more interpretability work similar to what's been done, at great scale, and then synthesizing/distilling insights from this work and iterating on that to the point where it can meaningfully "reverse-engineer" itself and provide a version of itself that humans can much more easily modify to be safe, or something.)

I wouldn't feel that surprised if greatly scaling t... (read more)

It feels like this post starts with a definition of "coherence theorem", sees that the so-called coherence theorems don't match this definition, and thus criticizes the use of the term "coherence theorem".

But this claimed definition of "coherence theorem" seems bad to me, and is not how I would use the phrase. Eliezer's definition, OTOH is:

If you are not shooting yourself in the foot in sense X, we can view you as having coherence property Y.

which seems perfectly fine to me. It's significant that this isn't completely formalized, and requires intuitive... (read more)

The point is: there are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. The VNM Theorem doesn't say that, nor does Savage's Theorem, nor does Bolker-Jeffrey, nor do Dutch Books, nor does Cox's Theorem, nor does the Complete Class Theorem.

But suppose we instead define 'coherence theorems' as theorems which state that

If you are not shooting yourself in the foot in sense X, we can view you as having coherence property

... (read more)

theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.

While I agree that such theorems would count as coherence theorems, I wouldn't consider this to cover most things I think of as coherence theorems, and as such is simply a bad definition.

I think of coherence theorems loosely as things that say if an agent follows such and such principles, then we can prove it will have a certain property. The usefulness comes from both... (read more)

5Elliott Thornley9mo
If you use this definition, then VNM (etc.) counts as a coherence theorem. But Premise 1 of the coherence argument (as I've rendered it) remains false, and so you can't use the coherence argument to get the conclusion that sufficiently-advanced artificial agents will be representable as maximizing expected utility.

[Epistemic status: very speculative]

One ray of hope that I've seen discussed is that we may be able to do some sort of acausal trade with even an unaligned AGI, such that it will spare us (e.g. it would give us a humanity-aligned AGI control of a few stars, in exchange for us giving it control of several stars in the worlds we win).

I think Eliezer is right that this wouldn't work.

But I think there are possible trades which don't have this problem. Consider the scenario in which we Win, with an aligned AGI taking control of our future light-cone. Assuming t... (read more)

It seems relatively plausible that you could use a Limited AGI to build a nanotech system capable of uploading a diverse assortment of (non-brain, or maybe only very small brains) living tissue without damaging them, and that this system would learn how to upload tissue in a general way. Then you could use the system (not the AGI) to upload humans (tested on increasingly complex animals). It would be a relatively inefficient emulation, but it doesn't seem obviously doomed to me.

Probably too late once hardware is available to do this though.

So in a "weird experiment", the infrabayesian starts by believing only one branch exists, and then at some point starts believing in multiple branches?

3Vanessa Kosoy2y
Multiple branches can only exist transiently during the weird experiment (i.e. neither before nor after). Naturally, if the agent knows in advance the experiment is going to happen, then it anticipates those branches to appear.

If there aren't other branches, then shouldn't that be impossible? Not just in practice but in principle.

2Vanessa Kosoy2y
The wavefunction has other branches, because it's the same mathematical object governed by the same equations. Only, the wavefunction doesn't exist physically, it's just an intermediate variable in the computation. The things that exist (corresponding to the Φ variable in the formalism) and the things that are experienced (corresponding to some function of the 2Γ variable in the formalism) only have one branch.

You can get some weird things if you are doing some weird experiment on yourself where you are becoming a Schrödinger cat and doing some weird stuff like that, you can get a situation where multiple copies of you exist. But if you’re not doing anything like that, you’re just one branch, one copy of everything.

Why does it matter that you are doing a weird experiment, versus the universe implicitly doing the experiment for you via decoherence? If someone else did the experiment on you without your knowledge, does infrabayesianism expect one copy or multiple copies?

3Vanessa Kosoy2y
By "weird experiment" I mean things like, reversing decoherence. That is, something designed to cause interference between branches of the wavefunction with minds that remember different experiences[1]. Which obviously requires levels of technology we are nowhere near to reaching[2]. As long as decoherence happens as usual, there is only one copy. ---------------------------------------- 1. Ofc it requires erasing their contradicting memories among other things. ↩︎ 2. There is a possible "shortcut" though, namely simulating minds on quantum computers. Naturally, in this case only the quantum-uploaded-minds can have multiple copies. ↩︎

If being versed in cryptography was enough, then I wouldn't expect Eliezer to claim being one of the last living descendents of this lineage.

Why would Zen help (and why do you think that)?

This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all.

I've previously noticed this weakness in myself. What lineage did Eliezer learn this from? I would appreciate any suggestions or advice on how to become stronger at this.

This came up with Aysajan about two months ago. An exercise which I recommended for him: first, pick a technical academic paper. Read through the abstract and first few paragraphs. At the end of each sentence (or after each comma, if the authors use very long sentences), pause and write/sketch a prototypical example of what you currently think they're talking about. The goal here is to get into the habit of keeping a "mental picture" (i.e. prototypical example) of what the authors are talking about as you read.

Other good sources on which to try this exerci... (read more)

CFAR used to have an awesome class called "Be specific!" that was mostly about concreteness. Exercises included:

  • Rationalist taboo
  • A group version of rationalist taboo where an instructor holds an everyday object and asks the class to describe it in concrete terms.
  • The Monday-Tuesday game
  • A role-playing game where the instructor plays a management consultant whose advice is impressive-sounding but contentless bullshit, and where the class has to force the consultant to be specific and concrete enough to be either wrong or trivial.
  • People were encouraged t
... (read more)
Cryptography was mentioned in this post in a relevant manner, though I don't have enough experience with it to advocate it with certainty. Some lineages of physics (EY points to Feynman) try to evoke this, though it's pervasiveness has decreased. You may have some luck with Zen. Generally speaking, I think if you look at the Sequences, the themes of physics, security mindset, and Zen are invoked for a reason.

[I may try to flesh this out into a full-fledged post, but for now the idea is only partially baked. If you see a hole in the argument, please poke at it! Also I wouldn't be very surprised if someone has made this point already, but I don't remember seeing such. ]

Dissolving the paradox of useful noise

A perfect bayesian doesn't need randomization.

Yet in practice, randomization seems to be quite useful.

How to resolve this seeming contradiction?

I think the key is that a perfect bayesian (Omega) is logically omniscient. Omega can always fully update on all o... (read more)

You're missing the point!

Your arguments apply mostly toward arguing that brains are optimized for energy efficiency, but the important quantity in question is computational efficiency! You even admit that neurons are "optimizing hard for energy efficiency at the expense of speed", but don't seem to have noticed that this fact makes almost everything else you said completely irrelevant!

Going to try answering this one:

Humbali: I feel surprised that I should have to explain this to somebody who supposedly knows probability theory. If you put higher probabilities on AGI arriving in the years before 2050, then, on average, you're concentrating more probability into each year that AGI might possibly arrive, than OpenPhil does. Your probability distribution has lower entropy. We can literally just calculate out that part, if you don't believe me. So to the extent that you're wrong, it should shift your probability distributions in the d

... (read more)

This plausibly looks like an existing collection of works which seem to be annotated in a similar way:

That seems a bit uncharitable to me. I doubt he rejects those heuristics wholesale. I'd guess that he thinks that e.g. recursive self improvement is one of those things where these heuristics don't apply, and that this is foreseeable because of e.g. the nature of recursion. I'd love to hear more about what sort of knowledge about "operating these heuristics" you think he's missing!

Anyway, it seems like he expects things to seem more-or-less gradual up until FOOM, so I think my original point still applies: I think his model would not be "shaken out" of his fast-takeoff view due to successful future predictions (until it's too late).

5Paul Christiano2y
He says things like AlphaGo or GPT-3 being really surprising to gradualists, suggesting he thinks that gradualism only works in hindsight. I agree that after shaking out the other disagreements, we could just end up with Eliezer saying "yeah but automating AI R&D is just fundamentally unlike all the other tasks to which we've applied AI" (or "AI improving AI will be fundamentally unlike automating humans improving AI") but I don't think that's the core of his position right now.

It seems like Eliezer is mostly just more uncertain about the near future than you are, so it doesn't seem like you'll be able to find (ii) by looking at predictions for the near future.

It seems to me like Eliezer rejects a lot of important heuristics like "things change slowly" and "most innovations aren't big deals" and so on. One reason he may do that is because he literally doesn't know how to operate those heuristics, and so when he applies them retroactively they seem obviously stupid. But if we actually walked through predictions in advance, I think he'd see that actual gradualists are much better predictors than he imagines.

I lean toward the foom side, and I think I agree with the first statement. The intuition for me is that it's kinda like p-hacking (there are very many possible graphs, and some percentage of those will be gradual), or using a log-log plot (which makes everything look like a nice straight line, but are actually very broad predictions when properly accounting for uncertainty). Not sure if I agree with the addendum or not yet, and I'm not sure how much of a crux this is for me yet.

Spending money on R&D is essentially the expenditure of resources in order to explore and optimize over a promising design space, right? That seems like a good description of what natural selection did in the case of hominids. I imagine this still sounds silly to you, but I'm not sure why. My guess is that you think natural selection isn't relevantly similar because it didn't deliberately plan to allocate resources as part of a long bet that it would pay off big.

4Paul Christiano2y
I think natural selection has lots of similarities to R&D, but (i) there are lots of ways of drawing the analogy, (ii) some important features of R&D are missing in evolution, including some really important ones for fast takeoff arguments (like the existence of actors who think ahead). If someones wants to spell out why they think evolution of hominids means takeoff is fast then I'm usually happy to explain why I disagree with their particular analogy. I think this happens in the next discord log between me and Eliezer.

There's more than just differential topology going on, but it's the thing that unifies it all. You can think of differential topology as being about spaces you can divide into cells, and the boundaries of those cells. Conservation laws are naturally expressed here as constraints that the net flow across the boundary must be zero. This makes conserved quantities into resources, for which the use of is convergently minimized. Minimal structures with certain constraints are thus led to forming the same network-like shapes, obeying the same sorts of laws. (See... (read more)

I think "deep fundamental theory" is deeper than just "powerful abstraction that is useful in a lot of domains".

Part of what makes a Deep Fundamental Theory deeper is that it is inevitably relevant for anything existing in a certain way. For example, Ramón y Cajal (discoverer of the neuronal structure of brains) wrote:

Before the correction of the law of polarization, we have thought in vain about the usefulness of the referred facts. Thus, the early emergence of the axon, or the displacement of the soma, appeared to us as unfavorable arrangements acting

... (read more)
4Adam Shimi2y
Can you go into more detail here? I have done a decent amount of maths but always had trouble in physics due to my lack of physical intuition, so it might be completely obvious but I'm not clear about what is "that same thing" or how it explains all your examples? Is it about shortest path? What aspect of differential topology (a really large field) captures it? (Maybe you literally can't explain it to me without me seeing the deep theory, which would be frustrating, but I'd want to know if that was the case. )

"you can't make an engine more efficient than a Carnot engine."

That's not what it predicts. It predicts you can't make a heat engine more efficient than a Carnot engine.

Haha yeah, I'm not surprised if this ends up not working, but I'd appreciate hearing why.

Elitzur-Vaidman AGI testing

One thing that makes AI alignment super hard is that we only get one shot.

However, it's potentially possible to get around this (though probably still very difficult).

The Elitzur-Vaidman bomb tester is a protocol (using quantum weirdness) by which a bomb may be tested, with arbitrarily little risk. It's interest comes from the fact that it works even when the only way to test the bomb is to try detonating it. It doesn't matter how the bomb works, as long as we can set things up so that it will allow/block a photon based on wheth... (read more)

1Matthew "Vaniver" Gray2y
IMO this is a 'additional line of defense' boxing strategy instead of simplification.  Note that in the traditional version, the 'dud' bit of the bomb can only be the trigger; a bomb that absorbs the photon but then explodes isn't distinguishable from a bomb that absorbs the photon and then doesn't explode (because of an error deeper in the bomb). But let's suppose the quantum computing folks can come up with something like this, where we keep some branches entangled and run analysis of the AI code in only one branch, causing an explosion there but affecting the total outcome in all branches. [This seems pretty implausible to me that you manage to maintain entanglement despite that much impact on the external world, but maybe it's possible.] Then 1) as you point out, we need to ensure that the AI doesn't realize that what it needs to output in that branch and 2) need some sort of way to evaluate "did the AI pass our checks or not?".  But, 2 is "the whole problem"!
I think we get enough things referencing quantum mechanics that we should probably explain why that doesn't work (if I it doesn't) rather than just downvoting and moving on.

Zurek's einselection seems like perhaps another instance of this, or at least related. The basic idea is (very roughly) that the preferred basis in QM is preferred because persistence of information selects for it.

I think Critch's "Futarchy" theorem counts as a (very nice) selection theorem.

How bad is the ending supposed to be? Are just people who fight the system killed, and otherwise, humans are free to live in the way AI expects them to (which might be something like keep consuming goods and providing AI-mediated feedback on the quality of those goods)? Or is it more like once humans are disempowered no machine has any incentive to keep them around anymore, so humans are not-so-gradually replaced with machines?

The main point of intervention in this scenario that stood out to me would be making sure that (during the paragraph beginning with... (read more)

6Paul Christiano3y
I think that most likely either humans are killed incidentally as part of the sensor-hijacking (since that's likely to be the easiest way to deal with them), or else AI systems reserve a negligible fraction of their resources to keep humans alive and happy (but disempowered) based on something like moral pluralism or being nice or acausal trade (e.g. the belief that much of their influence comes from the worlds in which they are simulated by humans who didn't mess up alignment and who would be willing to exchange a small part of their resources in order to keep the people in the story alive and happy). I don't think this is infeasible. It's not the intervention I'm most focused on, but it may be the easiest way to avoid this failure (and it's an important channel for advance preparations to make things better / important payoff for understanding what's up with alignment and correctly anticipating problems).

My guess is that a "clean" algorithm is still going to require multiple conceptual insights in order to create it. And typically, those insights are going to be found before we've had time to strip away the extraneous ideas in order to make it clean, which requires additional insights. Combine this with the fact that at least some of these insights are likely to be public knowledge and relevant to AGI, and I think Eliezer has the right idea here.

3Daniel Kokotajlo3y
OK, fair enough.

This gives a nice intuitive explanation for the Jeffery-Bolker rotation which basically is a way of interpreting a belief as a utility, and vice versa.

Some thoughts:

  • What do probabilities mean without reference to any sort of agent? Presumably it has something to do with the ability to "win" De Finetti games in expectation. For avoiding subtle anthropomorphization, it might be good to think of this sort of probability as being instantiated in a bacterium's chemical sensor, or something like that. And in this setting, it's clear it wouldn't mean anything w
... (read more)
6Alex Mennen3y
I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

Not quite sure how specifically this connects, but I think you would appreciate seeing it.

As a good example of the kind of gains we can get from abstraction, see this exposition of the HashLife algorithm, used to (perfectly) simulate Conway's Game of Life at insane scales.

Earlier I mentioned I would run some nontrivial patterns for trillions of generations. Even just counting to a trillion takes a fair amount of time for a modern CPU; yet HashLife can run the breeder to one trillion generations, and print its resulting population of 1,302,083,334,180,208,337,404 in less than a second.

Ooh, good one. If I remember the trick to the algorithm correctly, it can indeed be cast as abstraction. 

Entropy and temperature inherently require the abstraction of macrostates from microstates. Recommend reading this: if you haven't seen this before (or just want an unconfused explanation).

At some point I need to write a post on purely Bayesian statistical mechanics, in a general enough form that it's not tied to the specifics of physics. I can probably write a not-too-long explanation of how abstraction works in this context. I'll see what I can do.

Roughly my feelings:

Reasoning: I think lots of people have updated too much on GPT-3, and that the current ML paradigms are still missing key insights into general intelligence. But I also think enough research is going into the field that it won't take too long to reach those insights.

It seems that privacy potentially could "tame" a not-quite-corrigible AI. With a full model, the AGI might receive a request, deduce that activating a certain set of neurons strongly would be the most robust way to make you feel the request was fulfilled, and then design an electrode set-up to accomplish that. Whereas the same AI with a weak model wouldn't be able to think of anything like that, and might resort to fulfilling the request in a more "normal" way. This doesn't seem that great, but it does seem to me like this is actually part of what makes humans relatively corrigible.

Privacy as a component of AI alignment

[realized this is basically just a behaviorist genie, but posting it in case someone finds it useful]

What makes something manipulative? If I do something with the intent of getting you to do something, is that manipulative? A simple request seems fine, but if I have a complete model of your mind, and use it phrase things so you do exactly what I want, that seems to have crossed an important line.

The idea is that using a model of a person that is *too* detailed is a violation of human values. In particular, it violates... (read more)

1Adele Lopez3y
It seems that privacy potentially could "tame" a not-quite-corrigible AI. With a full model, the AGI might receive a request, deduce that activating a certain set of neurons strongly would be the most robust way to make you feel the request was fulfilled, and then design an electrode set-up to accomplish that. Whereas the same AI with a weak model wouldn't be able to think of anything like that, and might resort to fulfilling the request in a more "normal" way. This doesn't seem that great, but it does seem to me like this is actually part of what makes humans relatively corrigible.

Half-baked idea for low-impact AI:

As an example, imagine a board that's lodged directly by the wall (no other support structures). If you make it twice as wide, then it will be twice as stiff, but if you make it twice as thick, then it will be eight times as stiff. On the other hand, if you make it twice as long, it will be eight times more compliant.

In a similar way, different action parameters will have scaling exponents (or more generally, functions). So one way to decrease the risk of high-impact actions would be to make sure that the scaling expo... (read more)

Another way to make it countable would be to instead go to the category of posets, Then the rational interval basis is a poset with a countable number of elements, and by the Alexandroff construction corresponds to the real line (or at least something very similar). But, this construction gives a full and faithful embedding of the category of posets to the category of spaces (which basically means you get all and only continuous maps from monotonic function).

I guess the ontology version in this case would be the category of prosets. (Personally, I'm not sure that ontology of the universe isn't a type error).

Yeah, I think the engineer intuition is the bottleneck I'm pointing at here.

Load More