All of Mark Xu's Comments + Replies

AMA: Paul Christiano, alignment researcher

How would you teach someone how to get better at the engine game?

2Paul Christiano12dNo idea other than playing a bunch of games (might as well current version, old dailies probably best) and maybe looking at solutions when you get stuck. Might also just run through a bunch of games and highlight the main important interactions and themes for each of them, e.g. Innovation + Public Works + Reverberate [] or Hatchery + Till []. I think on any given board (and for the game in general) it's best to work backwards from win conditions, then midgames, and then openings.
1Neel Nanda14dWhat's the engine game?
AMA: Paul Christiano, alignment researcher

You've written multiple outer alignment failure stories. However, you've also commented that these aren't your best predictions. If you condition on humanity going extinct because of AI, why did it happen?

I think my best guess is kind of like this story, but:

  1. People aren't even really deploying best practices.
  2. ML systems generalize kind of pathologically over long time horizons, and so e.g. long-term predictions don't correctly reflect the probability of systemic collapse.
  3. As a result there's no complicated "take over the sensors moment" it's just everything is going totally off the rails and everyone is yelling about it but it just keeps gradually drifting on the rails.
  4. Maybe the biggest distinction is that e.g. "watchdogs" can actually give pretty good argume
... (read more)
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

I'm curious what "put it in my SuperMemo" means. Quick googling only yielded SuperMemo as a language learning tool.

2Alex Turner1moIt's a spaced repetition system that focuses on incremental reading. It's like Anki, but instead of hosting flashcards separately from your reading, you extract text while reading documents and PDFs. You later refine extracts into ever-smaller chunks of knowledge, at which point you create the "flashcard" (usually 'clozes', demonstrated below). Here's a Wikipedia article I pasted into SuperMemo. Blue bits are the extracts, which it'll remind me to refine into flashcards later.A cloze deletion flashcard. It's easy to make a lot of these. I like them.Incremental reading is nice because you can come back to information over time as you learn more, instead of having to understand enough to make an Anki card right away. In the context of this post, I'm reading some of the papers, making extracts, making flashcards from the extracts, and retaining at least one or two key points from each paper. Way better than retaining 1-2 points from all 70 summaries!
Transparency Trichotomy

I agree it's sort of the same problem under the hood, but I think knowing how you're going to go from "understanding understanding" to producing an understandable model controls what type of understanding you're looking for.

I also agree that this post makes ~0 progress on solving the "hard problem" of transparency, I just think it provides a potentially useful framing and creates a reference for me/others to link to in the future.

Open Problems with Myopia

One way of looking at DDT is "keeping it dumb in various ways." I think another way of thinking about is just designing a different sort of agent, which is "dumb" according to us but not really dumb in an intrinsic sense. You can imagine this DDT agent looking at agents that do do acausal trade and thinking they're just sacrificing utility for no reason.

There is some slight awkwardness in that the decision problems agents in this universe actually encounter means that UDT agents will get higher utility than DDT agents.

I agree that the maximum a posterior world doesn't help that much, but I think there is some sense in which "having uncertainty" might be undesirable.

Open Problems with Myopia

has been changed to imitation, as suggested by Evan.

Open Problems with Myopia

Yeah, you're right that it's obviously unsafe. The words "in theory" were meant to gesture at that, but it could be much better worded. Changed to "A prototypical example is a time-limited myopic approval-maximizing agent. In theory, such an agent has some desirable safety properties because a human would only approve safe actions (although we still would consider it unsafe)."

Open Problems with Myopia

Yep - I switched the setup at some point and forgot to switch this sentence. Thanks.

Defusing AGI Danger

My opposite intuition is suggested by the fact that if you're trying to guess correctly a series of random digits with 80% "1" and 20% "0", then you should always guess "1".

I don't quite know how to model cross-pollination and diminishing sort of returns. I think working on both for the information value is likely going to be very good. It seems hard to imagine a scenario where you're robustly confident that one project is 80% better taking diminishing returns into account without being able to create a 3rd project with the best features of both, but if yo... (read more)

2Daniel Kokotajlo5moHere's a way to model diminishing returns: The first hour of research on strategy X produces as much value as the next two hours, which produces as much value as the next four hours, etc. Value = log_2(hours). If this is true, then you should split your hours such that log_2(hourstowards80project)*0.8 + log_2(hourstoward20project)*0.2 is maximized, which I think means that you should distribute your hours across projects proportional to their probability...*0.8+%2B+log_2%281-X%29*0.2%29 [*0.8+%2B+log_2%281-X%29*0.2%29] (I don't know much math so I'm not confident I'm doing this right) Value of information I hadn't even considered, but maybe we can bundle it up with diminishing returns and say it's part of the reason returns diminish.
Defusing AGI Danger

I absolutely agree that I'm not arguing for "safety by default".

I don't quite agree that you should split effort between strategies, i.e. it seems likely that if you think 80% disaster by default, you should dedicate 100% of your efforts to that world.

3Daniel Kokotajlo5moOK, interesting. Well, here's my argument for effort-splitting then: There are probably diminishing returns to pursuing each strategy. In research in general, ideas and questions tend to cross-pollinate, etc. And if you are 20% confident that research project X is the most important, and 80% that research project Y is most important, and they are both on a similar topic, this seems like a classic case where you should do both (but with more effort towards Y). This is more of an intuition than an argument, I guess. But what do you think?
Operationalizing compatibility with strategy-stealing

Using the perspective from The ground of optimization means you can get rid of the action space and just say "given some prior and some utility function, what percentile of the distribution does this system tend to evolve towards?" (where the optimization power is again the log of this percentile)

We might then say that an optimizing system is compatible with strategy stealing if it's retargetable for a wide set of utility functions in a way that produces an optimizing system that has the same amount of optimization power.

An AI that is compatible with strat... (read more)

Defusing AGI Danger

Thanks! Also, oops - fixed.

Understanding “Deep Double Descent”

This post gave a slightly better understanding of the dynamics happening inside SGD. I think deep double descent is strong evidence that something like a simplicity prior exists in SGG, which might have actively bad generalization properties, e.g. by incentivizing deceptive alignment. I remain cautiously optimistic that approaches like Learning the Prior can get circumnavigate this problem.

A space of proposals for building safe advanced AI

I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards.

This seems like a reasonable way to think of debate.

I think, in practice (if this even means anything), the power of debate is quite bounded by the power of the human, so some other technique is needed to make the human capable of supervising complex debates, e.g. imitative amplification.

A space of proposals for building safe advanced AI

Debate: train M* to win debates against Amp(M).

I think Debate is closer to "train M* to win debates against itself as judged by Amp(M)".

2Richard Ngo6moWouldn't it just be "train M* to win debates against itself as judged by H"? Since in the original formulation of debate a human inspects the debate transcript without assistance. Anyway, I agree that something like this is also a reasonable way to view debate. In this case, I was trying to emphasise the similarities between Debate and the other techniques: I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards. Maybe an easier way to visualise this is that, given some question, M* answers that question, and then Amp(M) tries to identify any flaws in the argument by interrogating M*, and rewards M* if no flaws can be found.
Does SGD Produce Deceptive Alignment?

Yep. Meant to say "if a model knew that it was in its last training episode and it wasn't going to be deployed." Should be fixed.

Introduction to Cartesian Frames

This is very exciting. Looking forward to the rest of the sequence.

As I was reading, I found myself reframing a lot of things in terms of the rows and columns of the matrix. Here's my loose attempt to rederive most of the properties under this view.

  • The world is a set of states. One way to think about these states is by putting them in a matrix, which we call "cartesian frame." In this frame, the rows of the matrix are possible "agents" and the columns are possible "environments".
    • Note that you don't have to put all the states in the matrix.
  • Ensurable
... (read more)
Introduction to Cartesian Frames

In 4.1:

Given a0 and a1, since S∈Obs(C), there exists an a2∈A such that for all e∈E, we have a2∈if(S,a0,a1). Then, since T∈Obs(C), there exists an a3∈A such that for all e∈E, we have a3∈if(S,a0,a2). Unpacking and combining these, we get for all e∈E, a3∈if(S∪T,a0,a1). Since we could construct such an a3 from an arbitrary a0,a1∈A, we know that S∪T∈Obs(C). □

I think there's a typo here. Should be , not .

(also not sure how to copy latex properly).

1Scott Garrabrant7moYep. Fixed. Thanks.
The Solomonoff Prior is Malign

I personally see no fundamental difference between direct and indirect ways of influence, except in so far as they relate to stuff like expected value.

I agree that given the amount expected influence, other universes are not high on my priority list, but they are still on my priority list. I expect the same for consequentialists in other universes. I also expect consequentialist beings that control most of their universe to get around to most of the things on their priority list, hence I expect them to influence the Solmonoff prior.

Understanding “Deep Double Descent” gives a theoretical argument that suggests SGD will converge to a point that is very close in L2 norm to the initialization. Since NNs are often initialized with extremely small weights, this amounts to implicit L2 regularization. 

Forecasting Thread: AI Timelines

My rough take:

3 buckets, similar to Ben Pace's 

  1. 5% chance that current techniques just get us all the way there, e.g. something like GPT-6 is basically AGI
  2. 10% chance AGI doesn't happen this century, e.g. humanity sort of starts taking this seriously and decides we ought to hold off + the problem being technically difficult enough that small groups can't really make AGI themselves
  3. 50% chance that something like current techniques and some number of new insights gets us to AGI. 

If I thought about this ... (read more)