Did you end up running it through your internal infohazard review and if so what was the result?

1y1

With help from David Manheim, this post has now been turned into a paper. Thanks to everyone who commented on the post!

2y8

Rob, are you able to disclose why people at Open Phil are interested in learning more decision theory? It seems a little far away from the AI strategy reports they've been publishing in recent years, and it also seemed like they were happy to keep funding MIRI (via their Committee for Effective Altruism Support) despite disagreements about the value of HRAD research, so the sudden interest in decision theory is intriguing.

2y3

I was in the chat and don't have anything especially to "disclose". Joe and Nick are both academic philosophers who've studied at Oxford and been at FHI, with a wide range of interests. And Abram and Scott are naturally great people to chat about decision theory with when they're available.

2y4

I was reading parts of *Superintelligence* recently for something unrelated and noticed that Bostrom makes many of the same points as this post:

...If the frontrunner is an AI system, it could have attributes that make it easier for it to expand its capabilities while reducing the rate of diffusion. In human-run organizations, economies of scale are counteracted by bureaucratic inefficiencies and agency problems, including difficulties in keeping trade secrets. These problems would presumably limit the growth of a machine intelligence project so long as it is op

So the existence of this interface implies that A is “weaker” in a sense than A’.

Should that say B instead of A', or have I misunderstood? (I haven't read most of the sequence.)

22y

It should, good catch, thanks!

3y9

Does anyone know how Brian Christian came to be interested in AI alignment and why he decided to write this book instead of a book about a different topic? (I haven't read the book and looked at the Amazon preview but couldn't find the answer there.)

3y2

HCH is the result of a potentially infinite exponential process (see figure 1) and thereby,

computationally intractable. In reality, we can not break down any task into its smallest parts and solve these subtasks one after another because that would take too much computation. This is why we need to iteratedistillationand amplification and cannot just amplify.

In general your post talks about amplification (and HCH) as increasing the capability of the system and distillation as saving on computation/making things more efficient. But my understanding, based...

3y1

I still don't understand how corrigibility and intent alignment are different. If neither implies the other (as Paul says in his comment starting with "*I don't really think this is true"*), then there must be examples of AI systems that have one property but not the other. What would a corrigible but not-intent-aligned AI system look like?

I also had the thought that the implicative structure (between corrigibility and intent alignment) seems to depend on how the AI is used, i.e. on the particulars of the user/overseer. For example if you have an intent-alig...

43y

Suppose that I think you know me well and I want you to act autonomously on my
behalf using your best guesses. Then you can be intent aligned without being
corrigible. Indeed, I may even prefer that you be incorrigible, e.g. if I want
your behavior to be predictable to others. If the agent knows that I have such a
preference then it can't be both corrigible and intent aligned.

13y

+1

3y2

IDA tries to prevent catastrophic outcomes by searching for a

competitive AIthatnever intentionally optimises for something harmfulto us and that wecan still correctonce it’s running.

I don't see how the "we can still correct once it’s running" part can be true given this footnote:

...However, I think at some point we will probably have the AI system autonomously execute the distillation and amplification steps or otherwise get outcompeted. And even before that point we might find some other way to train the AI in breaking down tasks that doesn’t involve h

12y

One interpretation is that even if the AI is autonomously executing
distillation/amplification steps, the relevant people are still able to say
"hold on, we need to modify your algorithm" and have the AI actually let us
correct it.

- Rohin Shah's talk gives one taxonomy of approaches to AI alignment: https://youtu.be/AMSKIDEbjLY?t=1643 (During Q&A Rohin also mentions some other stuff)
- This podcast episode also talks about similar things: https://futureoflife.org/2019/04/11/an-overview-of-technical-ai-alignment-with-rohin-shah-part-1/

- Wei Dai's success stories post is another way to organize the various approaches: https://www.lesswrong.com/posts/bnY3L48TtDrKTzGRb/ai-safety-success-stories
- I started trying to organize AI alignment agendas myself a while back, but never got far:

23y

Thanks. Your post specifically is pretty helpful because it helps with one of
the things that was tripping me up, which is what standard names people call
different methods. Your names do a better job of capturing them than mine did.

I'm confused about the tradeoff you're describing. Why is the first bullet point "Generating better ground truth data"? It would make more sense to me if it said instead something like "Generating large amounts of non-ground-truth data". In other words, the thing that amplification seems to be providing is access to more data (even if that data isn't the ground truth that is provided by the original human).

Also in the second bullet point, by "increasing the amount of data that you train on" I think you mean increasing the amount of data from the original h

...43y

By "ground truth" I just mean "the data that the agent is trained on", feel free
to just ignore that part of the phrase.
But it is important that it is better data. The point of amplification is that
Amplify(M) is more competent than M, e.g. it is a better speech writer, it has a
higher ELO rating for chess, etc. This is because Amplify(M) is supposed to
approximate "M thinking for a longer time".
Yes, that's right.
Paul's posts often do talk about this, e.g. An unaligned benchmark
[https://ai-alignment.com/an-unaligned-benchmark-b49ad992940b], and the
competitiveness desideratum in Directions and desiderata for AI alignment
[https://ai-alignment.com/directions-and-desiderata-for-ai-control-b60fca0da8f4].
I agree though that it's hard to realize this since the posts are quite
scattered.
I suspect Paul would say that it is plausibly competitive relative to training a
system using RL with a fixed reward function (because the additional
human-in-the-loop effort could be a small fraction of that, as long as we do
semi-supervised RL well).
However, maybe we train systems in some completely different way (e.g. GPT-2
style language models), it's very hard to predict right now how IDA would
compare to that.

The addition of the distillation step is an extra confounder, but we hope that it doesn't distort anything too much -- its purpose is to improve speed without affecting anything else (though in practice it will reduce capabilities somewhat).

I think this is the crux of my confusion, so I would appreciate if you could elaborate on this. (Everything else in your answer makes sense to me.) In Evans et al., during the distillation step, the model learns to solve the difficult tasks directly by using example solutions from the amplification step. But if c

...33y

You could do this, but it's expensive. In practice, from the perspective of
distillation, there's always a tradeoff between:
* Generating better ground truth data (which you can do by amplifying the agent
that generates the ground truth data)
* Improving the accuracy of the distilled model (which you can do by increasing
the amount of data that you train on, and other ML tricks)
You could get to an Issa-level model using just the second method for long
enough, but it's going to be much more efficient to get to an Issa-level model
by alternating the two methods.

It seems like "agricultural revolution" is used to mean both the beginning of agriculture ("First Agricultural Revolution") and the 18th century agricultural revolution ("Second Agricultural Revolution").

I have only a very vague idea of what you mean. Could you give an example of how one would do this?

Just to make sure I understand, the first few expansions of the second one are:

- f(n)
- f(n+1)
- f((n+1) + 1)
- f(((n+1) + 1) + 1)
- f((((n+1) + 1) + 1) + 1)

Is that right? If so, wouldn't the infinite expansion look like f((((...) + 1) + 1) + 1) instead of what you wrote?

13y

Yes, that's correct. I'd view "f((((...) + 1) + 1) + 1)" as an equivalent way of
writing it as a string (along with the definition of f as f(n) = f(n + 1)).
"...(((((...) + 1) + 1) + 1) + 1)..." just emphasizes that the expression tree
does not have a root - it goes to infinity in both directions. By contrast, the
expression tree for f(n) = f(n) + 1 does have a root; it would expand to
(((((...) + 1) + 1) + 1) + 1).
Does that make sense?

I read the post and parts of the paper. Here is my understanding: conditions similar to those in Theorem 2 above don't exist, because Alex's paper doesn't take an arbitrary utility function and prove instrumental convergence; instead, the idea is to set the rewards for the MDP randomly (by sampling i.i.d. from some distribution) and then show that in most cases, the agent seeks "power" (states which allow the agent to obtain high rewards in the future). So it avoids the twitching robot not by saying that it can't make use of additional resources, but by sa

...23y

That's right; that would prove too much.
Yeah, although note that I proved asymptotic instrumental convergence for
typical functions under iid reward sampling assumptions at each state, so I
think there's wiggle room to say "but the reward functions we provide aren't
drawn from this distribution!". I personally think this doesn't matter much,
because the work still tells us a lot about the underlying optimization
pressures.
The result is also true in the general case of an arbitrary reward function
distribution, you just don't know in advance which terminal states the
distribution prefers.

Can you say more about Alex Turner's formalism? For example, are there conditions in his paper or post similar to the conditions I named for Theorem 2 above? If so, what do they say and where can I find them in the paper or post? If not, how does the paper avoid the twitching robot from seeking convergent instrumental goals?

33y

Sure, I can say more about Alex Turner's formalism! The theorems show that, with
respect to some distribution of reward functions and in the limit of
farsightedness (as the discount rate goes to 1), the optimal policies under this
distribution tend to steer towards parts of the future which give the agent
access to more terminal states.
Of course, there exist reward functions for which twitching or doing nothing is
optimal. The theorems say that most reward functions aren't like this.
I encourage you to read the post and/or paper; it's quite different from the one
you cited in that it shows how instrumental convergence and power-seeking arise
from first principles. Rather than assuming "resources" exist, whatever that
means, resource acquisition is explained as a special case of power-seeking.
ETA: Also, my recently completed sequence focuses on formally explaining and
deeply understanding why catastrophic behavior seems to be incentivized. In
particular, see The Catastrophic Convergence Conjecture
[https://www.lesswrong.com/posts/w6BtMqKRLxG9bNLMr/the-catastrophic-convergence-conjecture].

3y4

One additional source that I found helpful to look at is the paper "Formalizing Convergent Instrumental Goals" by Tsvi Benson-Tilsen and Nate Soares, which tries to formalize Omohundro's instrumental convergence idea using math. I read the paper quickly and skipped the proofs, so I might have misunderstood something, but here is my current interpretation.

The key assumptions seem to appear in the statement of Theorem 2; these assumptions state that using additional resources will allow the agent to implement a strategy that gives it strictly higher utility

...23y

Yeah, that upshot sounds pretty reasonable to me. (Though idk if it's reasonable
to think of that as endorsed by "all of MIRI".)
Note that this requires the utility function to be completely indifferent to
humans (or actively against them).

33y

See also Alex Turner's work
[https://www.lesswrong.com/posts/6DuJxY8X45Sco4bS2/seeking-power-is-instrumentally-convergent-in-mdps]
on formalizing instrumentally convergent goals, and his walkthrough
[https://www.lesswrong.com/posts/KXMqckn9avvY4Zo9W/walkthrough-of-formalizing-convergent-instrumental-goals]
of the MIRI paper.

3y4

I'm confused about what it means for a hypothesis to "want" to score better, to change its predictions to get a better score, to print manipulative messages, and so forth. In probability theory each hypothesis is just an event, so is static, cannot perform actions, etc. I'm guessing you have some other formalism in mind but I can't tell what it is.

33y

Yeah, in probability theory you don't have to worry about how everything is
implemented. But for implementations of Bayesian modeling with a rich hypothesis
class, each hypothesis could be something like a blob of code which actually
does a variety of things.
As for "want", sorry for using that without unpacking it. What it specifically
means is that hypotheses like that will have a tendency to get more probability
weight in the system, so if we look at the weighty (and thus influential)
hypotheses, they are more likely to implement strategies which achieve those
ends.

43y

I interpreted it as an ensemble of expert models, weighted in a Bayesian fashion
based on past performance. But because of the diagnostic logs, the type
signature is a little different; the models output both whatever probability
distributions over queries / events and arbitrary text in some place.
Then there's a move that I think of as the 'intentional stance move', where you
look at a system that rewards behavior of a particular type (when updating the
weights based on past success, you favor predictions that thought an event was
more likely than its competitors did), and so pretend that the things in the
system "want" to do the behavior that's rewarded. [Like, even in this paragraph,
'reward' is this sort of mental shorthand; it's not like any of the models have
an interior preference to have high weight in the ensemble, it's just that the
ensemble's predictions are eventually more like the predictions of the models
that did things that happened to lead to having higher weight.]

To me, it seems like the two distinctions *are* different. There seem to be three levels to distinguish:

- The reward (in the reinforcement learning sense) or the base objective (example: inclusive genetic fitness for humans)
- A mechanism in the brain that dispenses pleasure or provides a proxy for the reward (example: pleasure in humans)
- The actual goal/utility that the agent ends up pursuing (example: a reflective equilibrium for some human's values, which might have nothing to do with pleasure or inclusive genetic fitness)

The base objective vs mesa-object

...Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:

- time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
- act-based ("short") vs goal-based ("long"): using the human's (or more generally, the human-plus-AI-assistants'; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization

4y5

By "short" I mean short in sense (1) and (2). "Short" doesn't imply anything about senses (3), (4), (5), or (6) (and "short" and "long" don't seem like good words to describe those axes, though I'll keep using them in this comment for consistency).

By "preferences-on-reflection" I mean long in sense (3) and neither in sense (6). There is a hypothesis that "humans with AI help" is a reasonable way to capture preferences-on-reflection, but they aren't defined to be the same. I don...

64y

(BTW Paul, if you're reading this, Issa and I and a few others have been
chatting about this on MIRIxDiscord. I'm sure you're more than welcome to join
if you're interested, but I figured you probably don't have time for it. PM me
if you do want an invite.)
Issa, I think my current understanding of what Paul means is roughly the same as
yours, and I also share your confusion about “the user-on-reflection might be
happy with the level of corrigibility, but the user themselves might be
unhappy”.
To summarize my own understanding (quoting myself from the Discord), what Paul
means by "satisfying short-term preferences-on-reflection" seems to cash out as
"do the action for which the AI can produce an explanation such that a
hypothetical human would evaluate it as good (possibly using other AI
assistants), with the evaluation procedure itself being the result of a
hypothetical deliberation which is controlled by the
preferences-for-deliberation that the AI learned/inferred from a real human."
(I still have other confusions around this. For example is the "hypothetical
human" here (the human being predicted in Issa's 3) a hypothetical end user
evaluating the action based on what they themselves want, or is it a
hypothetical overseer evaluating the action based on what the overseer thinks
the end user wants? Or is the "hypothetical human" just a metaphor for some
abstract, distributed, or not recognizably-human deliberative/evaluative process
at this point?)
I think maybe it would make sense to further break (6) down into 2
sub-dimensions: (6a) understandable vs evaluable and (6b) how much AI
assistance. "Understandable" means the human achieves an understanding of the
(outer/main) AI's rationale for action within their own brain, with or without
(other) AI assistance (which can for example answer questions for the human or
give video lectures, etc.). And "evaluable" means the human runs or participates
in a procedure that returns a score for how good the action is, but

Thanks. It looks like all the realistic examples I had of weak HCH are actually examples of strong HCH after all, so I'm looking for some examples of weak HCH to help my understanding. I can see how weak HCH would compute the answer to a "naturally linear recursive" problem (like computing factorials) but how would weak HCH answer a question like "Should I get laser eye surgery?" (to take an example from here). The natural way to decompose a problem like this seems to use branching.

Also, I just looked again at Alex Zhu's FAQ for Paul's agenda, and Alex's e

...44y

I just looked at both that and the original strong HCH post, and I think Alex's
explanation is right and I was mistaken: both weak and strong HCH use tree
recursion. Edited the original answer to not talk about linear / tree recursion.

Thanks! I found this answer really useful.

I have some follow-up questions that I'm hoping you can answer:

- I didn't realize that weak HCH uses linear recursion. On the original HCH post (which is talking about weak HCH), Paul talks in comments about "branching factor", and Vaniver says things like "So he asks HCH to separately solve A, B, and C". Are Paul/Vaniver talking about strong HCH here, or am I wrong to think that branching implies tree recursion? If Paul/Vaniver are talking about weak HCH, and branching does imply tree recursion, then it seems like

44y

ETA: This paragraph is wrong, see rest of comment thread. I'm pretty sure
they're talking about strong HCH. At the time those comments were written (and
now), HCH basically always referred to strong HCH, and it was rare for people to
talk about weak HCH. (This is mentioned at the top of the strong HCH post
[https://ai-alignment.com/strong-hch-bedb0dc08d4e].)
That's correct.
There are two different meanings of the word "agent" here. First, there's the
BaseAgent, the one which we start with that's already human-level (both HCH and
IDA assume access to such a BaseAgent). Then, there's the PowerfulAgent, which
is the system as a whole, which can be superhuman. With HCH, the entire infinite
tree of BaseAgents together makes up the PowerfulAgent. With IDA / recursive
reward modeling, either the neural net (output of distillation) or the one-level
tree (output of amplification) could be considered the PowerfulAgent. Initially,
the PowerfulAgent is only as capable as the BaseAgent, but as training
progresses, it becomes more capable, reaching the HCH-PowerfulAgent in the
limit.
For all methods, the BaseAgent is human level, and the PowerfulAgent is
(eventually) superhuman.
Yes, my bad. (Technically, I was talking about both the annotated functional
programming and the level of indirection, I think.)
I was restricting attention to imitation-based IDA, which is the canonical
example, and what various
[https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd/p/HqLxuZ4LhaFhmAHWk]
explainer
[https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd/p/DFkGStzvj3jgXibFG]
articles
[https://www.alignmentforum.org/out?url=https%3A%2F%2Farxiv.org%2Fpdf%2F1810.08575.pdf]
are explaining.

4y2

The link no longer works (I get "This project has not yet been moved into the new version of Overleaf. You will need to log in and move it in order to continue working on it.") Would you be willing to re-post it or move it so that it is visible?

5y5

My solution for #3:

Define by . We know that is continuous because and the identity map both are, and by the limit laws. Applying the intermediate value theorem (problem #2) we see that there exists such that . But this means , so we are done.

Counterexample for the open interval: consider defined by . First, we can verify that if then , so indeed maps to . To see that there is no fixed point, note that the only solution to in is , which is no

5y3

Yeah, I did the same thing :)

Putting it right after #2 was highly suggestive - I wonder if this means there's some very different route I would have thought of instead, absent the framing.

5y3

EDIT: I've got another framing that I thought would be more useful for later problems, but I was wrong. I still think there is some value in understanding this proof as well.

In particular, look at this diagram on Wikipedia. It would be better if the whole upper triangle was blue and the whole lower triangle were red instead of just one side (you can arbitrarily decide whether to paint the rest of the diagonal blue or red). If x=0 and x=1 aren't fixed points, then they must be blue and red respectively. If we split [0,1] into n components of size ...

Here is my attempt, based on Hoagy's proof.

Let be an integer. We are given that and . Now consider the points in the interval . By 1-D Sperner's lemma, there are an odd number of such that and (i.e. an odd number of "segments" that begin below zero and end up above zero). In particular, is an even number, so there must be at least one such number . Choose the smallest and call this number .

Now consider the sequence . Since this sequence takes values in

I'm having trouble understanding why we can't just fix in your proof. Then at each iteration we bisect the interval, so we wouldn't be using the "full power" of the 1-D Sperner's lemma (we would just be using something close to the base case).

Also if we are only given that is continuous, does it make sense to talk about the gradient?

5y3

Yeah agreed, in fact I don't think you even need to continually bisect, you can just increase n indefinitely. Iterating becomes more dangerous as you move to higher dimensions because an n dimensional simplex with n+1 colours that has been coloured according to analogous rules doesn't necessarily contain the point that maps to zero.

On the second point, yes I'd been assuming that a bounded function had a bounded gradient, which certainly isn't true for say sin(x^2), the final step needs more work, I like the way you did it in the proof below.

5y4

"I'm having trouble understanding why we can't just fix n=2 in your proof. Then at each iteration we bisect the interval, so we wouldn't be using the "full power" of the 1-D Sperner's lemma (we would just be using something close to the base case)." - You're right, you can prove this without using the full power of Sperner's lemma. I think it becomes more useful for the multi-dimensional case.

I didn't log the time I spent on the original blog post, and it's kinda hard to assign hours to this since most of the reading and thinking for the post happened while working on the modeling aspects of the MTAIR project. If I count just the time I sat down to write the blog post, I would guess maybe less than 20 hours.

As for the "convert the post to paper" part, I did log that time and it came out to 89 hours, so David's estimate of "perhaps another 100 hours" is fairly accurate.