All of danieldewey's Comments + Replies

This caused me to find your substack! Sorry I missed it earlier, looking forward to catching up.

FWIW, I found the Strawberry Appendix especially helpful for understanding how this approach to ELK could solve (some form of) outer alignment.

Other readers, consider looking at the appendix even if you don't feel like you fully understand the main body of the post!

1Victoria Krakovna1y
+1. This section follows naturally from the rest of the article, and I don't see why it's labeled as an appendix -  this seems like it would unnecessarily discourage people from reading it. 

Nice post! I see where you're coming from here.

(ETA: I think what I'm saying here is basically "3.5.3 and 3.5.4 seem to me like they deserve more consideration, at least as backup plans -- I think they're less crazy than you make them sound." So I don't think you missed these strategies, just that maybe we disagree about how crazy they look.)

I haven't thought this through all the way yet, and don't necessarily endorse these strategies without more thought, but: 

It seems like there could be a category of strategies for players with "good" AGIs to prepa... (read more)

Thanks for the post, I found it helpful! the "competent catastrophes" direction sounds particularly interesting.

This is extremely cool -- thank you, Peter and Owen! I haven't read most of it yet, let alone the papers, but I have high hopes that this will be a useful resource for me.

5Alex Turner3y
I agree. I've put it in my SuperMemo and very much look forward to going through it. Thanks Peter & Owen!

Thanks for the post! FWIW, I found this quote particularly useful:

Well, on my reading of history, that means that all sorts of crazy things will be happening, analogous to the colonialist conquests and their accompanying reshaping of the world economy, before GWP growth noticeably accelerates!

The fact that it showed up right before an eye-catching image probably helped :)

1Daniel Kokotajlo3y
Thanks! Self-nitpick: Now that you draw my attention to the meme again, I notice that for the first few panels AGI comes before Industry but in the last panel Conquistadors comes before Persuasion Tools. This bugs me. Does it bug anyone else? Should I redo it with Industry first throughout?

This may be out-of-scope for the writeup, but I would love to get more detail on how this might be an important problem for IDA.

3Beth Barnes3y
Yep, planning to put up a post about that soon. The short argument is something like: The equivalent of an obfuscated argument for IDA is a decomposition that includes questions the model doesn't know how to answer.  We can't always tell the difference between an IDA tree that uses an obfuscated decomposition and gets the wrong answer, vs an IDA tree that uses a good decomposition and gets the right answer, without unpacking the entire tree

Thanks for the writeup! This google doc (linked near "raised this general problem" above) appears to be private:

This seems like a useful lens -- thanks for taking the time to post it!

Thanks for writing this -- I think it's a helpful kind of reflection for people to do!

Ah, gotcha. I'll think about those points -- I don't have a good response. (Actually adding "think about"+(link to this discussion) to my todo list.)

It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.

Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?

1Wei Dai7y
Interesting question. I guess it depends on the form that the metaphilosophical knowledge arrives in, but it's currently hard to see what that could be. I can only think of two possibilities, but neither seem highly plausible. 1. It comes as a set of instructions that humans (or emulations/models of humans) can use to safely and correctly process philosophical arguments, along with justifications for why those instructions are safe/correct. Kind of like a detailed design for meta-execution along with theory/evidence for why it works. But natural language is fuzzy and imprecise, humans are full of unknown security holes, so it's hard to see how such instructions could possibly make us safe/correct, or what kind of information could possibly make us confident of that. 2. It comes as a set of algorithms for reasoning about philosophical problems in a formal language, along with instructions/algorithms for how to translate natural language philosophical problems/arguments into this formal language, and justifications for why these are all safe/correct. But this kind of result seems very far from any of our current knowledge bases, nor does it seem compatible with any of the current trends in AI design (including things like deep learning, decision theory based ideas and Paul's kinds of designs). So I'm not very optimistic that a metaphilosophical approach will succeed either. If it ultimately does, it seems like maybe there will have to be some future insights whose form I can't foresee. (Edit: Either that or a lot of time free from arms-race pressure to develop the necessary knowledge base and compatible AI design for 2.)

These objections are all reasonable, and 3 is especially interesting to me -- it seems like the biggest objection to the structure of the argument I gave. Thanks.

I'm afraid that the point I was trying to make didn't come across, or that I'm not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul's are not amenable to any kind of argument for confidence, and we will only ever be able to say "well, I ran out of ideas for how to break it", so I wanted to sketch an argument structure to exp

... (read more)
0Wei Dai7y
I guess my point was that any argument for confidence will likely be subject to the kinds of problems I listed, and I don't see a realistic plan on Paul's (or anyone else's) part to deal with them. It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Without that, I don't see how we can reason about a system that can encounter philosophical arguments, and make strong conclusions about whether it's able to process them correctly. This seems intuitively obvious to me, but I don't totally rule out that there is some sort of counterintuitive approach that could somehow work out.

"naturally occurring" means "could be inputs to this AI system from the rest of the world"; naturally occurring inputs don't need to be recognized, they're here as a base case for the induction. Does that make sense?

If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren't, I'd guess that possible input single pages of text aren't value-corrupting in an hour. (I would certainly want a much better answer than "I guess it's f

... (read more)
2Wei Dai7y
Yes, that clarifies what you were intending to say. Paul typically assumes a need to compete with other AIs with comparable resources, so I wasn't expecting "naturally occurring" to mean coming from an environment with no other powerful reasoners. I think if we're in an actual non-competitive scenario (and hence can tolerate large inefficiencies in our AI design), then some sort of argument like this can possibly be made to work, but it will probably be much trickier than you seem to suggest. Here are some problems I can see aside from the ones you've already acknowledged. 1. We probably don't need "powerful reasoners" to produce value-corrupting single pages of text, just a model of Som and relatively crude optimization techniques. In other words, for this system to be safe you probably need more of a tech lead/monopoly than you had in mind. Exactly how much is needed seems hard to know, so how do you achieve high confidence of safety? 2. Presumably you're building this system to be at least as capable as a human but at less risk of value drift. In order to do work comparable to a human contemplating the input query over a few days or years, the system needs some way of transmitting information between the short-lived Soms. Given that, what is preventing the overall system from undergoing value drift because, for example, one of the Soms has a (incorrect and value corrupting) philosophical epiphany and transmits it to other Soms though this communications channel? Nothing in your argument seems to adequately deal with this. If this happens by pure chance then you can try to wash it out by averaging over different "noise seeds" but what if Som has a tendency towards certain incorrect lines of argument, either naturally or because you placed him in a weird environment like this one? 3. The output of this system isn't "naturally occurring" so subsequent inputs to it won't be either. If we're to use this system a second time in a way that preserves your "non-cor

My comment, for the record:

I'm glad to see people critiquing Paul's work -- it seems very promising to me relative to other alignment approaches, so I put high value on finding out about problems with it. By your definition of "benign", I don't think humans are benign, so I'm not going to argue with that. Instead, I'll say what I think about building aligned AIs out of simulated human judgement.

I agree with you that listing and solving problems with such systems until we can't think of more problems is unsatisfying, and that we should have positive argumen

... (read more)
0Wei Dai7y
>(e.g. almost no naturally occurring single pages of text are value-corrupting in an hour) I don't see what "naturally occurring" here could mean (even informally) that would both make this statement true and make it useful to try to design a system that could safely process "naturally occurring single pages of text". And how would a system like this know whether a given input is "naturally occurring" and hence safe to process? Please explain?

I also commented there last week and am awaiting moderation. Maybe we should post our replies here soon?

If I read Paul's post correctly, ALBA is supposed to do this in theory -- I don't understand the theory/practice distinction you're making.

0Stuart Armstrong7y
I disagree. I'm arguing that the concept of "aligned at a certain capacity" makes little sense, and this is key to ALBA in theory.

I'm not sure you've gotten quite ALBA right here, and I think that causes a problem for your objection. Relevant writeups: most recent and original ALBA.

As I understand it, ALBA proposes the following process:

  1. H trains A to choose actions that would get the best immediate feedback from H. A is benign (assuming that H could give not-catastrophic immediate feedback for all actions and that the learning process is robust). H defines the feedback, and so A doesn't make decisions that are more effective at anything than H is; A is just faster.
  2. A (and possibly
... (read more)
0Stuart Armstrong7y
This is roughly how I would run ALBA in practice, and why I said it was better in practice than in theory. I'd be working with considerations I mentioned in this post and try and formalise how to extend utilities/rewards to new settings.

FWIW, this also reminded me of some discussion in Paul's post on capability amplification, where Paul asks whether we can even define good behavior in some parts of capability-space, e.g.:

The next step would be to ask: can we sensibly define “good behavior” for policies in the inaccessible part H? I suspect this will help focus our attention on the most philosophically fraught aspects of value alignment.

I'm not sure if that's relevant to your point, but it seemed like you might be interested.

Discussed briefly in Concrete Problems, FYI:

This is a neat idea! I'd be interested to hear why you don't think it's satisfying from a safety point of view, if you have thoughts on that.

Yes, as Owen points out, there are general problems with reduced impact that apply to this idea, i.e. measuring long-term impacts.
I was mostly a gut-feeling when I posted, but let me try and articulate a few: 1. It relies on having a good representation. Small problems with the representation might make it unworkable. Learning a good enough representation and verifying that you've done so doesn't seem very feasible. Impact may be missed if the representation doesn't properly capture unobserved things and long-term dependencies. Things like the creation of sub-agents seem likely to crop up in subtle, hard to learn, ways. 2. I haven't looked into it, but ATM I have no theory about when this scheme could be expected to recover the "correct" model (I don't even know how that would be defined... I'm trying to "learn" my way around the problem :P) To put #1 another way, I'm not sure that I've gained anything compared with proposals to penalize impact in the input space, or some learned representation space (with the learning not directed towards discovering impact). On the other hand, I was inspired to consider this idea when thinking about Yoshua's proposal about causal disentangling mentioned at the end of his Asilomar talk here: This (and maybe some other similar work, e.g. on empowerment) seem to provide a way to direct an agent's learning towards maximizing its influence, which might help... although having an agent learn based on maximizing its influence seems like a bad idea... but I guess you might be able to then add a conflicting objective (like a regularizer) to actually limit the impact... So then you'd end up with some sort of adversarial-ish set-up, where the agent is trying to both: 1. maximize potential impact (i.e. by understanding its ability to influence the world) 2. minimize actual impact (i.e. by refraining from taking actions which turn out (eventually) to have a large impact). Having just finished typing this, I feel more optimistic about this last proposal than the original idea :D We want an agent to learn about how to
2Owen Cotton-Barratt7y
Seems to me like there are a bunch of challenges. For example you need extra structure on your space to add things or tell what's small; and you really want to keep track of long-term impact not just at the next time-step. Particularly the long-term one seems thorny (for low-impact in general, not just for this). Nevertheless I think this idea looks promising enough to explore further, would also like to hear David's reasons.

Thanks for writing this, Jessica -- I expect to find it helpful when I read it more carefully!

Thanks. I agree that these are problems. It seems to me that the root of these problems is logical uncertainty / vingean reflection (which seem like two sides of the same coin); I find myself less confused when I think about self-modeling as being basically an application of "figuring out how to think about big / self-like hypotheses". Is that how you think of it, or are there aspects of the problem that you think are missed by this framing?

1Jessica Taylor8y
Yes, this is also how I think about it. I don't know anything specific that doesn't fit into this framing.

Thanks Jessica. This was helpful, and I think I see more what the problem is.

Re point 1: I see what you mean. The intuition behind my post is that it seems like it should be possible to make a bounded system that can eventually come to hold any computable hypothesis given enough evidence, including a hypothesis including a model of itself of arbitrary precision (which is different from Solomonoff, which can clearly never think about systems like itself). It's clearly not possible for the system to hold and update infinitely many hypotheses the way Solomono

... (read more)
1Jessica Taylor8y
Re point 1: Suppose the agent considers all hypotheses of length up to l bits that run in up to t time. Then the agent takes 2lt time to run. For an individual hypothesis to reason about the agent, it must use t computation time to reason about a computation of size 2lt. A theoretical understanding of how this works would solve a large part of the logical uncertainty / naturalized induction / Vingean reflection problem. Maybe it's possible for this to work without having a theoretical understanding of why it works, but the theoretical understanding is useful too (it seems like you agree with this). I think there are some indications that naive solutions won't automatically work; see e.g. this post. Re point 2: It seems like this is learning a model from the state and action to state, and a model from state to state that ignores the agent. But it isn't learning a model that e.g. reasons about the agent's source code to predict the next state. An integrated model should be able to do reasoning like this. Re point 3: I think you still have a Vingean reflection problem if a hypothesis that runs in t time predicts a computation of size 2lt. Reflective Solomonoff induction solves a problem with an unrealistic computation model, and doesn't translate to a solution with a finite (but large) amount of computing resources. The main part not solved is the general issue of predicting aspects of large computations using a small amount of computing power.

Thanks, Paul -- I missed this response earlier, and I think you've pointed out some of the major disagreements here.

I agree that there's something somewhat consequentialist going on during all kinds of complex computation. I'm skeptical that we need better decision theory to do this reliably -- are there reasons or intuition-pumps you know of that have a bearing on this?

1Paul Christiano8y
I mentioned two (which I don't find persuasive): 1. Different decision theories / priors / etc. are reflectively consistent, so you may want to make sure to choose the right ones the first time. (I think that the act-based approach basically avoids this.) 2. We have encountered some surprising possible failure modes, like blackmail by distant superintelligences, and might be concerned that we will run into new surprises if we don't understand consequentialism well. I guess there is one more: 1. If we want to understand what our agents are doing, we need to have a pretty good understanding of how effective decision-making ought to work. Otherwise algorithms whose consequentialism we understand will tend to be beaten out by algorithms whose consequentialism we don't understand. This may make alignment way harder.

Thanks Jessica, I think we're on similar pages -- I'm also interested in how to ensure that predictions of humans are accurate and non-adversarial, and I think there are probably a lot of interesting problems there.

Thanks Jessica -- sorry I misunderstood about hijacking. A couple of questions:

  • Is there a difference between "safe" and "accurate" predictors? I'm now thinking that you're worried about NTMs basically making inaccurate predictions, and that accurate predictors of planning will require us to understand planning.

  • My feeling is that today's current understanding of planning -- if I run this computation, I will get the result, and if I run it again, I'll get the same one -- are sufficient for harder prediction tasks. Are there particular aspects of planni

... (read more)
1Jessica Taylor8y
* A very accurate predictor will be safe. A predictor that is somewhat accurate but not very accurate could be unsafe. So yes, I'm concerned that with a realistic amount of computing resources, NTMs might make dangerous partially-accurate predictions, even though they would make safe accurate predictions with a very large amount of computing resources. This seems like it will be true if the NTM is predicting the human's actions by trying to infer the human's goal and then outputting a plan towards this goal, though perhaps there are other strategies for efficiently predicting a human. (I think some of the things I said previously were confusing -- I said that it seems likely that an NTM can learn to plan well unsafely, which seems confusing since it can only be unsafe by making bad predictions. As an extreme example, perhaps the NTM essentially implements a consequentialist utility maximizer that decides what predictions to output; these predictions will be correct sometimes and incorrect whenever it is in the consequentialist utility maximizer's interest). * It seems like current understanding of planning is already running into bottlenecks -- e.g. see the people working on attention for neural networks. If the NTM is predicting a human making plans by inferring the human's goal and planning towards this goal, then there needs to be some story for what e.g. decision theory and logical uncertainty it's using in making these plans. For it to be the right decision theory, it must have this decision theory in its hypothesis space somewhere. In situations where the problem of planning out what computations to do to predict a human is as complicated as the human's actual planning, and the human's planning involves complex decision theory (e.g. the human is writing a paper on decision theory), this might be a problem. So you might need to understand some amount of decision theory / logical uncertainty to make this predictor. (note that I'm not completely sold on this

I agree with paragraphs 1, 2, and 3. To recap, the question we're discussing is "do you need to understand consequentialist reasoning to build a predictor that can predict consequentialist reasoners?"

A couple of notes on paragraph 4:

  • I'm not claiming that neural nets or NTMs are sufficient, just that they represent the kind of thing I expect to increasingly succeed at modeling human decisions (and many other things of interest): model classes that are efficiently learnable, and that don't include built-in planning faculties.
  • You are bringing up understand
... (read more)
0Jessica Taylor8y
Regarding paragraph 4: I see more now what you're saying about NTMs. In some sense NTMs don't have "built-in" planning capabilities; to the extent that they plan well, it's because they learned that transition functions that make plans work better to predict some things. I think it's likely that you can get planning capabilities in this manner, without actually understanding how the planning works internally. So it seems like there isn't actually disagreement on this point (sorry for misinterpreting the question). The more controversial point is that you need to understand planning to train safe predictors of humans making plans. I don't think I was bringing up consequentialist hypotheses hijacking the system in this paragraph. I was noting the danger of having a system (which is in some sense just trying to predict humans well) output a plan it thinks a human would produce after thinking a very long time, given that it is good at predicting plans toward an objective but bad at predicting the humans' objective. Regarding paragraph 5: I was trying to say that you probably only need primitive planning abilities for a lot of prediction tasks, in some cases ones we already understand today. For example, you might use a deep neural net for deciding which weather simulations are worth running, and reinforce the deep neural net on the extent to which running the weather simulation changed the system's accuracy. This is probably sufficient for a lot of applications.

Thanks, Jessica. This argument still doesn't seem right to me -- let me try to explain why.

It seems to me like something more tractable than Solomonoff induction, like an approximate cognitive-level model of a human or the other kinds of models that are being produced now (or will be produced in the future) in machine learning (neural nets, NTMs, other etc.), could be used to approximately predict the actions of humans making plans. This is how I expect most kinds of modeling and inference to work, about humans and about other systems of interest in the w

... (read more)
2Jessica Taylor8y
It seems like an important part of how humans make plans is that we use some computations to decide what other computations are worth performing. Roughly, we use shallow pattern recognition on a question to determine what strategy to use to think further thoughts, and after thinking those thoughts use shallow pattern recognition to figure out what thought to have after that, eventually leading to answering the question. (I expect the brain's actual algorithm to be much more complicated than this simple model, but to share some aspects of it). A system predicting what a human would do would presumably also have to figure out which further thoughts are worth thinking, upon being asked to predict how a human answers a question. For example, if I'm answering a complex math question that I have to break into parts to solve it, then for the system to predict my (presumably correct) answer, it might also break the problem into pieces and solve each piece. If it's bad at determining which thoughts are worth thinking to predict the human's answer (e.g. it chooses to break the problem into unhelpful pieces), then it will think thoughts that are not very useful for predicting the answer, so it will not be very effective without a huge amount of hardware. I think this is clear when the human is thinking for a long time (e.g. 2 weeks) and less clear for much shorter time periods (e.g. 1 minute, which you might be able to do with shallow pattern recognition in some cases?). At the point where the system is able to figure out what thoughts to think in order to predict the human well, its planning to determine which thoughts to think looks at least as competent a human's planning to answer the question, without necessarily using similar intermediate steps in the plan. It seems like ordinary neural nets can't decide what to think about (they can only recognize shallow patterns), and perhaps NTMs can. But if a NTM could predict how I answer some questions well (because it's able t

"Additionally, the fact that the predictor uses consequentialist reasoning indicates that you probably need to understand consequentialist reasoning to build the predictor in the first place."

I've had this conversation with Nate before, and I don't understand why I should think it's true. Presumably we think we will eventually be able to make predictors that predict a wide variety of systems without us understanding every interesting subset ahead of time, right? Why are consequentialists different?

3Paul Christiano8y
Here is my understanding of the argument: (Warning, very long post, partly thinking out loud. But I endorse the summary. I would be most interested in Eliezer's response.) * Something vaguely "consequentialist" is an important part of how humans reason about hard cognitive problems of all kinds (e.g. we must decide what cognitive strategy to use, what to focus our attention on, what topics to explore and which to ignore). * It's not clear what prediction problems require this kind of consequentialism and what kinds of prediction problems can be solved directly by a brute force search for predictors. (I think Ilya has suggested that the cutoff is something like "anything a human can do in 100ms, you can train directly.") * However, the behavior of an intelligent agent is in some sense a "universal" example of a hard-to-predict-without-consequentialism phenomenon. * If someone claims to have a solution that "just" requires a predictor, then they haven't necessarily reduced the complexity of the problem, given that a good predictor depend on something consequentialist. If the predictor only needs to apply in some domain, then maybe the domain is easy and you can attack it more directly. But if that domain includes predicting intelligent agents, then it's obviously not easy. * Actually building an agent that solves these hard prediction problems will probably require building some kind of consequentialism. So it offers just as much opportunity to kill yourself as the general AI problem. * And if you don't explicitly build in consequentialism, then you've just made the situation even worse. There is still probably consequentialism somewhere inside your model, you just don't even understand how it works because it was produced by a brute force search. I think that this argument is mostly right. I also think that many thoughtful ML researchers would agree with the substantive claims, though they might disagree about language. We aren't going to be able to directly
1Jessica Taylor8y
Here's the argument as I understand it (paraphrasing Nate). If we have a system predict a human making plans, then we need some story for why it can do this effectively. One story is that, like Solomonoff induction, it's learning a physical model of the human and simulating the human this way. However, in practice, this is unlikely to be the reason an actual prediction engine predicts humans well (it's too computationally difficult). So we need some other argument for why the predictor might work. Here's one argument: perhaps it's looking at a human making plans, figuring out what humans are planning towards, and using its own planning capabilities (towards the same goal) to predict what plans the human will make. But it seems like, to be confident that this will work, you need to have some understanding of how the predictor's planning capabilities work. In particular, humans trying to study correct planning run into some theoretical problems including decision theory, and it seems like a system would need to answer some of these same questions in order to predict humans well. I'm not sure what to think of this argument. Paul's current proposal contains reinforcement learning agents who plan towards an objective defined by a more powerful agent, so it is leaning on the reinforcement learner's ability to plan towards desirable goals. Rather than understand how the reinforcement learner works internally, Paul proposes giving the reinforcement learner a good enough objective (defined by the powerful agent) such that optimizing this objective is equivalent to optimizing what the humans want. This raises some problems, so probably some additional ingredient is necessary. I suspect I'll have better opinions on this after thinking about the informed oversight problem some more. I also asked Nate about the analogy between computer vision and learning to predict a human making plans. It seems like computer vision is an easier problem for a few reasons: it doesn't require

Cool, thanks; sounds like I have about the same picture. One missing ingredient for me that was resolved by your answer, and by going back and looking at the papers again, was the distinction between consistency and soundness (on the natural numbers), which is not a distinction I think about often.

In case it's useful, I'll note that the procrastination paradox is hard for me to take seriously on an intuitive level, because some part of me thinks that requiring correct answers in infinite decision problems is unreasonable; so many reasoning systems fail on

... (read more)

I don't (confidently) understand why the procrastination paradox indicates a problem to be solved. Could you clarify that for me, or point me to a clarification?

First off, it doesn't seem like this kind of infinite buck-passing could happen in real life; is there a real-life (finite?) setting where this type of procrastination leads to bad actions? Second, it seems to me that similar paradoxes often come up in other situations where agents have infinite time horizons and can wait as long as they want -- does the problem come from the infinity, or from some

... (read more)
0Jessica Taylor9y
It is definitely a problem with infinite buck-passing. It is probably possible to prove optimality if we have a continuous utility function (e.g. we're using discounting). I think we might actually want a continuous utility function, but maybe not. Is there any time t such that you would consider it almost as good for a wonderful human civilization to exist for t steps and then die, compared to existing indefinitely? The way I would express the procrastination paradox is something like: 1. There's the tiling agents problem: we want AIs to construct successors that they trust to make correct decisions. 2. It would be desirable to have a system where an infinite sequence of AIs each trust the next one. If it worked, this would solve the tiling agents problem. 3. But, if we have something like this, then it will be unsound: it will prove that the button will eventually get pressed, even though it will never actually get pressed. We can construct things that do press the button, but they don't have the property of trusting successors that is desirable in some ways. Due to their handling of recursion, Paul's logic and reflective oracles are both candidates for solving the tiling agents problem, however they both fail the procrastination paradox (when it's set up this way).

It seems that if it is desired, the overseer could also set their behaviour and intentions so that the approval-directed agent acts as we would want an oracle or tool to act. This is a nice feature.

I think Nick Bostrom and Stuart Armstrong would also be interested in this, and might have good feedback for you.

High-level feedback: this is a really interesting proposal, and looks like a promising direction to me! Most of my inline comments on Medium are more critical, but that doesn't reflect my overall assessment.

I would be curious to see more thoughts on this from people who have thought more than I have about stable/reliable self-improvement/tiling. Broadly speaking, I am also somewhat skeptical that it's the best problem to be working on now. However, here are some considerations in favor:

It seems plausible to me that an AI will be doing most of the design work before it is a "human-level reasoner" in your sense. The scenario I have in mind is a self-improvement cycle by a machine specialized in CS and math, which is either better than humans at these things, or

... (read more)
0Eliezer Yudkowsky9y
I'll very quickly remark that I think that the competence gap is indeed the main issue. If we imagine an AI built to a level where it was as smart as all the mathematicians who could work on the problem in advance, but able to do the same work faster, which didn't use any self-improvement along the way, and it was otherwise within a Friendliness framework that well-decided its preferences among what decision framework would control whatever stability framework it invented, then clearly there's no advantage to trying to do the work in advance. But I think the competence gap is much larger than that zero level.

I wonder if this example can be used to help pin down desiderata for decisions or decision counterfactuals. What axiom(s) for decisions would avoid this general class of exploits?

Hm, I don't know what the definition is either. In my head, it means "can get an arbitrary amount of money from", e.g. by taking it around a preference loop as many times as you like. In any case, glad the feedback was helpful.

Nice example! I think I understood better why this picks out the particular weakness of EDT (and why it's not a general exploit that can be used against any DT) when I thought of it less as a money-pump and more as "Not only does EDT want to manage the news, you can get it to pay you a lot for the privilege".

2Abram Demski9y
There is a nuance that needs to be mentioned here. If the EDT agent is aware of the researcher's ploys ahead of time, it will set things up so that emails from the researcher go straight to the spam folder, block the researcher's calls, and so on. It is not actually happy to pay the researcher for managing the news! This is less pathological than listening to the researcher and paying up, but it's still an odd news-management strategy that's result of EDT.
0Benya Fallenstein9y
Thanks! I didn't really think at all about whether or not "money-pump" was the appropriate word (I'm not sure what the exact definition is); have now changed "way to money-pump EDT agents" into "way to get EDT agents to pay you for managing the news for them".

I'd love to see what this approach looked like worked out in more detail. I think there would be a lot of benefit to making this kind of result accessible to computer scientists without strong math backgrounds.

Thanks Benja, Elliott, and Vladimir for creating the site -- it looks great.