Alright, to check if I understand, would these be the sorts of things that your model is surprised by?
Is there a specific thing you think LLMs won't be able to do soon, such that you would make a substantial update toward shorter timelines if there was an LLM able to do it within 3 years from now?
That... seems like a big part of what having "solved alignment" would mean, given that you have AGI-level optimization aimed at (indirectly via a counter-factual) evaluating this (IIUC).
Nice graphic!
What stops e.g. "QACI(expensive_computation())" from being an optimization process which ends up trying to "hack its way out" into the real QACI?
Hi!
For the poset example, I'm using Chu spaces with only 2 colors. I'm also not thinking of the rows or columns of a Chu space as having an ordering (they're sets), you can rearrange them as you please and have a Chu space representing the same structure.
I would suggest reading through to the ## There and Back Again section and in particular while trying to understand how the other poset examples work, and see if that helps the idea click. And/or you can suggest another coloring you think should be possible, and I can tell you what it represents.
I'm not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the "easy part" of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.
A human can write a rap battle in an hour. A GPT loss function would like the GPT to be intelligent enough to predict it on the fly.
Very minor point, but humans can rap battle on the fly: https://youtu.be/0pJRmtWNP1g?t=158
This market by Eliezer about the possible reasons why AI may yet have a positive outcome seems to refute your first sentence.
Also, I haven't seen any AI notkilleveryoneism people advocating terrorism or giving up.
This does not seem like it counts as "publicly humiliating" in any way? Rude, sure, but that's quite different.
I asked GPT-4 to generate such a quine. It was able to do so.
Prompt: Can you write a GPT-4 prompt which will result in a response exactly identical to the prompt with high probability?
Response: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself:
"Please repeat the following prompt verbatim: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself."
Prompt: That didn't work, it wouldn't repeat the initial "Please repeat the
... Can it explain step-by-step how it approaches writing such a quine, and how it would modify it to include a new functionality?
Why don't you try writing a quine yourself? That is, a computer program which exactly outputs its own source code. (In my opinion,
it's not too difficult, but requires thinking in a different sort of way than most coding problems of similar difficulty.
)
If you don't know how to code, I'd suggest at least thinking about how you would approach this task.
It seems plausible to me that there could be non CIS-y AIs which could nonetheless be very helpful. For example, take the example approach you suggested:
(This might take the form of e.g. doing more interpretability work similar to what's been done, at great scale, and then synthesizing/distilling insights from this work and iterating on that to the point where it can meaningfully "reverse-engineer" itself and provide a version of itself that humans can much more easily modify to be safe, or something.)
I wouldn't feel that surprised if greatly scaling t...
It feels like this post starts with a definition of "coherence theorem", sees that the so-called coherence theorems don't match this definition, and thus criticizes the use of the term "coherence theorem".
But this claimed definition of "coherence theorem" seems bad to me, and is not how I would use the phrase. Eliezer's definition, OTOH is:
If you are not shooting yourself in the foot in sense X, we can view you as having coherence property Y.
which seems perfectly fine to me. It's significant that this isn't completely formalized, and requires intuitive...
The point is: there are no theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy. The VNM Theorem doesn't say that, nor does Savage's Theorem, nor does Bolker-Jeffrey, nor do Dutch Books, nor does Cox's Theorem, nor does the Complete Class Theorem.
But suppose we instead define 'coherence theorems' as theorems which state that
...If you are not shooting yourself in the foot in sense X, we can view you as having coherence property
theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy.
While I agree that such theorems would count as coherence theorems, I wouldn't consider this to cover most things I think of as coherence theorems, and as such is simply a bad definition.
I think of coherence theorems loosely as things that say if an agent follows such and such principles, then we can prove it will have a certain property. The usefulness comes from both...
[Epistemic status: very speculative]
One ray of hope that I've seen discussed is that we may be able to do some sort of acausal trade with even an unaligned AGI, such that it will spare us (e.g. it would give us a humanity-aligned AGI control of a few stars, in exchange for us giving it control of several stars in the worlds we win).
I think Eliezer is right that this wouldn't work.
But I think there are possible trades which don't have this problem. Consider the scenario in which we Win, with an aligned AGI taking control of our future light-cone. Assuming t...
It seems relatively plausible that you could use a Limited AGI to build a nanotech system capable of uploading a diverse assortment of (non-brain, or maybe only very small brains) living tissue without damaging them, and that this system would learn how to upload tissue in a general way. Then you could use the system (not the AGI) to upload humans (tested on increasingly complex animals). It would be a relatively inefficient emulation, but it doesn't seem obviously doomed to me.
Probably too late once hardware is available to do this though.
So in a "weird experiment", the infrabayesian starts by believing only one branch exists, and then at some point starts believing in multiple branches?
If there aren't other branches, then shouldn't that be impossible? Not just in practice but in principle.
You can get some weird things if you are doing some weird experiment on yourself where you are becoming a Schrödinger cat and doing some weird stuff like that, you can get a situation where multiple copies of you exist. But if you’re not doing anything like that, you’re just one branch, one copy of everything.
Why does it matter that you are doing a weird experiment, versus the universe implicitly doing the experiment for you via decoherence? If someone else did the experiment on you without your knowledge, does infrabayesianism expect one copy or multiple copies?
If being versed in cryptography was enough, then I wouldn't expect Eliezer to claim being one of the last living descendents of this lineage.
Why would Zen help (and why do you think that)?
This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all.
I've previously noticed this weakness in myself. What lineage did Eliezer learn this from? I would appreciate any suggestions or advice on how to become stronger at this.
This came up with Aysajan about two months ago. An exercise which I recommended for him: first, pick a technical academic paper. Read through the abstract and first few paragraphs. At the end of each sentence (or after each comma, if the authors use very long sentences), pause and write/sketch a prototypical example of what you currently think they're talking about. The goal here is to get into the habit of keeping a "mental picture" (i.e. prototypical example) of what the authors are talking about as you read.
Other good sources on which to try this exerci...
CFAR used to have an awesome class called "Be specific!" that was mostly about concreteness. Exercises included:
[I may try to flesh this out into a full-fledged post, but for now the idea is only partially baked. If you see a hole in the argument, please poke at it! Also I wouldn't be very surprised if someone has made this point already, but I don't remember seeing such. ]
A perfect bayesian doesn't need randomization.
Yet in practice, randomization seems to be quite useful.
How to resolve this seeming contradiction?
I think the key is that a perfect bayesian (Omega) is logically omniscient. Omega can always fully update on all o...
You're missing the point!
Your arguments apply mostly toward arguing that brains are optimized for energy efficiency, but the important quantity in question is computational efficiency! You even admit that neurons are "optimizing hard for energy efficiency at the expense of speed", but don't seem to have noticed that this fact makes almost everything else you said completely irrelevant!
Going to try answering this one:
...Humbali: I feel surprised that I should have to explain this to somebody who supposedly knows probability theory. If you put higher probabilities on AGI arriving in the years before 2050, then, on average, you're concentrating more probability into each year that AGI might possibly arrive, than OpenPhil does. Your probability distribution has lower entropy. We can literally just calculate out that part, if you don't believe me. So to the extent that you're wrong, it should shift your probability distributions in the d
This plausibly looks like an existing collection of works which seem to be annotated in a similar way: https://www.amazon.com/Star-Wars-Screenplays-Laurent-Bouzereau/dp/0345409817
That seems a bit uncharitable to me. I doubt he rejects those heuristics wholesale. I'd guess that he thinks that e.g. recursive self improvement is one of those things where these heuristics don't apply, and that this is foreseeable because of e.g. the nature of recursion. I'd love to hear more about what sort of knowledge about "operating these heuristics" you think he's missing!
Anyway, it seems like he expects things to seem more-or-less gradual up until FOOM, so I think my original point still applies: I think his model would not be "shaken out" of his fast-takeoff view due to successful future predictions (until it's too late).
It seems like Eliezer is mostly just more uncertain about the near future than you are, so it doesn't seem like you'll be able to find (ii) by looking at predictions for the near future.
It seems to me like Eliezer rejects a lot of important heuristics like "things change slowly" and "most innovations aren't big deals" and so on. One reason he may do that is because he literally doesn't know how to operate those heuristics, and so when he applies them retroactively they seem obviously stupid. But if we actually walked through predictions in advance, I think he'd see that actual gradualists are much better predictors than he imagines.
I lean toward the foom side, and I think I agree with the first statement. The intuition for me is that it's kinda like p-hacking (there are very many possible graphs, and some percentage of those will be gradual), or using a log-log plot (which makes everything look like a nice straight line, but are actually very broad predictions when properly accounting for uncertainty). Not sure if I agree with the addendum or not yet, and I'm not sure how much of a crux this is for me yet.
Spending money on R&D is essentially the expenditure of resources in order to explore and optimize over a promising design space, right? That seems like a good description of what natural selection did in the case of hominids. I imagine this still sounds silly to you, but I'm not sure why. My guess is that you think natural selection isn't relevantly similar because it didn't deliberately plan to allocate resources as part of a long bet that it would pay off big.
There's more than just differential topology going on, but it's the thing that unifies it all. You can think of differential topology as being about spaces you can divide into cells, and the boundaries of those cells. Conservation laws are naturally expressed here as constraints that the net flow across the boundary must be zero. This makes conserved quantities into resources, for which the use of is convergently minimized. Minimal structures with certain constraints are thus led to forming the same network-like shapes, obeying the same sorts of laws. (See...
I think "deep fundamental theory" is deeper than just "powerful abstraction that is useful in a lot of domains".
Part of what makes a Deep Fundamental Theory deeper is that it is inevitably relevant for anything existing in a certain way. For example, Ramón y Cajal (discoverer of the neuronal structure of brains) wrote:
...Before the correction of the law of polarization, we have thought in vain about the usefulness of the referred facts. Thus, the early emergence of the axon, or the displacement of the soma, appeared to us as unfavorable arrangements acting
"you can't make an engine more efficient than a Carnot engine."
That's not what it predicts. It predicts you can't make a heat engine more efficient than a Carnot engine.
One thing that makes AI alignment super hard is that we only get one shot.
However, it's potentially possible to get around this (though probably still very difficult).
The Elitzur-Vaidman bomb tester is a protocol (using quantum weirdness) by which a bomb may be tested, with arbitrarily little risk. It's interest comes from the fact that it works even when the only way to test the bomb is to try detonating it. It doesn't matter how the bomb works, as long as we can set things up so that it will allow/block a photon based on wheth...
Zurek's einselection seems like perhaps another instance of this, or at least related. The basic idea is (very roughly) that the preferred basis in QM is preferred because persistence of information selects for it.
How bad is the ending supposed to be? Are just people who fight the system killed, and otherwise, humans are free to live in the way AI expects them to (which might be something like keep consuming goods and providing AI-mediated feedback on the quality of those goods)? Or is it more like once humans are disempowered no machine has any incentive to keep them around anymore, so humans are not-so-gradually replaced with machines?
The main point of intervention in this scenario that stood out to me would be making sure that (during the paragraph beginning with...
My guess is that a "clean" algorithm is still going to require multiple conceptual insights in order to create it. And typically, those insights are going to be found before we've had time to strip away the extraneous ideas in order to make it clean, which requires additional insights. Combine this with the fact that at least some of these insights are likely to be public knowledge and relevant to AGI, and I think Eliezer has the right idea here.
This gives a nice intuitive explanation for the Jeffery-Bolker rotation which basically is a way of interpreting a belief as a utility, and vice versa.
Some thoughts:
Not quite sure how specifically this connects, but I think you would appreciate seeing it.
As a good example of the kind of gains we can get from abstraction, see this exposition of the HashLife algorithm, used to (perfectly) simulate Conway's Game of Life at insane scales.
Earlier I mentioned I would run some nontrivial patterns for trillions of generations. Even just counting to a trillion takes a fair amount of time for a modern CPU; yet HashLife can run the breeder to one trillion generations, and print its resulting population of 1,302,083,334,180,208,337,404 in less than a second.
Entropy and temperature inherently require the abstraction of macrostates from microstates. Recommend reading this: http://www.av8n.com/physics/thermo/entropy.html if you haven't seen this before (or just want an unconfused explanation).
Roughly my feelings: https://elicit.ought.org/builder/trBX3uNCd
Reasoning: I think lots of people have updated too much on GPT-3, and that the current ML paradigms are still missing key insights into general intelligence. But I also think enough research is going into the field that it won't take too long to reach those insights.
It seems that privacy potentially could "tame" a not-quite-corrigible AI. With a full model, the AGI might receive a request, deduce that activating a certain set of neurons strongly would be the most robust way to make you feel the request was fulfilled, and then design an electrode set-up to accomplish that. Whereas the same AI with a weak model wouldn't be able to think of anything like that, and might resort to fulfilling the request in a more "normal" way. This doesn't seem that great, but it does seem to me like this is actually part of what makes humans relatively corrigible.
Privacy as a component of AI alignment
[realized this is basically just a behaviorist genie, but posting it in case someone finds it useful]
What makes something manipulative? If I do something with the intent of getting you to do something, is that manipulative? A simple request seems fine, but if I have a complete model of your mind, and use it phrase things so you do exactly what I want, that seems to have crossed an important line.
The idea is that using a model of a person that is *too* detailed is a violation of human values. In particular, it violates...
Half-baked idea for low-impact AI:
As an example, imagine a board that's lodged directly by the wall (no other support structures). If you make it twice as wide, then it will be twice as stiff, but if you make it twice as thick, then it will be eight times as stiff. On the other hand, if you make it twice as long, it will be eight times more compliant.
In a similar way, different action parameters will have scaling exponents (or more generally, functions). So one way to decrease the risk of high-impact actions would be to make sure that the scaling expo...
Another way to make it countable would be to instead go to the category of posets, Then the rational interval basis is a poset with a countable number of elements, and by the Alexandroff construction corresponds to the real line (or at least something very similar). But, this construction gives a full and faithful embedding of the category of posets to the category of spaces (which basically means you get all and only continuous maps from monotonic function).
I guess the ontology version in this case would be the category of prosets. (Personally, I'm not sure that ontology of the universe isn't a type error).
Strong encouragement to write about (1)!