All of adamShimi's Comments + Replies

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

I've been wanting to try SuperMemo for a while, especially given the difficulty that you mention with making Anki cards. But it doesn't run natively on linux AFAIK, and I can't be bothered for the moment to make it work using wine.

2Alex Turner2hApparently VMs are the way to go for pdf support on linux.
Identifiability Problem for Superrational Decision Theories

As outlined in the last paragraph of the post. I want to convince people that TDT-like decision theories won't give a "neat" game theory, by giving an example where they're even less neat than classical game theory.

Hum, then I'm not sure I understand in what way classical game theory is neater here?

I think you're thinking about a realistic case (same algorithm, similar environment) rather than the perfect symmetry used in the argument. A communication channel is of no use there because you could just ask yourself what you would send, if you had one, and th

... (read more)
Identifiability Problem for Superrational Decision Theories

Well, if I understand the post correctly, you're saying that these two problems are fundamentally the same problem, and so rationality should be able to solve them both if it can solve one. I disagree with that, because from the perspective of distributed computing (which I'm used to), these two problems are exactly the two kinds of problems that are fundamentally distinct in a distributed setting: agreement and symmetry-breaking.

Communication won't make a difference if you're playing with a copy.

Actually it could. Basically all of distributed computing as... (read more)

1Bunthut3dNo. I think: As outlined in the last paragraph of the post. I want to convince people that TDT-like decision theories won't give a "neat" game theory, by giving an example where they're even less neat than classical game theory. I think you're thinking about a realistic case (same algorithm, similar environment) rather than the perfect symmetry used in the argument. A communication channel is of no use there because you could just ask yourself what you would send, if you had one, and then you know you would have just gotten that message from the copy as well. I'd be interested. I think even just more solved examples of the reasoning we want are useful currently.
Identifiability Problem for Superrational Decision Theories

I don't see how the two problems are the same. They are basically the agreement and symmetry breaking problems of distributed computing, and those two are not equivalent in all models. What you're saying is simply that in the no-communication model (where the same algorithm is used on two processes that can't communicate), these two problems are not equivalent. But they are asking for fundamentally different properties, and are not equivalent in many models that actually allow communication. 

1Bunthut3d"The same" in what sense? Are you saying that what I described in the context of game theory is not surprising, or outlining a way to explain it in retrospect? Communication won't make a difference if you're playing with a copy.
Phylactery Decision Theory

I feel like doing a better job of motivating why we should care about this specific problem might help get you more feedback.

If we want to alter a decision theory to learn its set of inputs and outputs, your proposal makes sense to me at first glance. But I'm not sure why I should particularly care, or why there is even a problem to begin with solution. The link you provide doesn't help me much after skimming it, and I (and I assume many people) almost never read something that requires me to read other posts without even a summary of the references. I mad... (read more)

Testing The Natural Abstraction Hypothesis: Project Intro

This project looks great! I especially like the focus on a more experimental kind of research, while still focused and informed on the specific concepts you want to investigate.

If you need some feedback on this work, don't hesitate to send me a message. ;)

Vanessa Kosoy's Shortform

Oh, right, that makes a lot of sense.

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

1Vanessa Kosoy17dYes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is ϵ and your quantilization fraction is ϕ then the AI's probability of corruption is bounded by ϵϕ. Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn't specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.
Review of "Fun with +12 OOMs of Compute"

About the update

You're right, that's what would happen with an update.

I think that the model I have in mind (although I hadn't explicitly thought about it until know), is something like a distribution over ways to reach TAI (capturing how probable it is that they're the first way to reach AGI), and each option comes with its own distribution (let's say over years). Obviously you can compress that into a single distribution over years, but then you lose the ability to do fine grained updating.

For example, I imagine that someone with relatively low probabili... (read more)

Review of "Fun with +12 OOMs of Compute"

Let me try to make an analogy with your argument.

Say we want to make X. What you're saying is "with 10^12 dollars, we could do it that way". Why on earth would I update at all whether it can be done with 10^6 dollars? If your scenario works with that amount, then you should have described it using only that much money. If it doesn't, then you're not providing evidence for the cheaper case.

Similarly here, if someone starts with a low credence on prosaic AGI, I can see how your arguments would make them put a bunch of probability mass close to +10^12 compute... (read more)

I'm not sure, but I think that's not how updating works? If you have a bunch of hypotheses (e.g. "It'll take 1 more OOM," "It'll take 2 more OOMs," etc.) and you learn that some of them are false or unlikely (only 10% chance of it taking more than 12" then you should redistribute the mass over all your remaining hypotheses, preserving their relative strengths. And yes I have the same intuition about analogical arguments too. For example, let's say you overhear me talking about a bridge being built near my h... (read more)

Review of "Fun with +12 OOMs of Compute"

You're welcome!

To put it another way: I don't actually believe we will get to +12 OOMs of compute, or anywhere close, anytime soon. Instead, I think that if we had +12 OOMs, we would very likely get TAI very quickly, and then I infer from that fact that the probability of getting TAI in the next 6 OOMs is higher than it would otherwise be (if I thought that +12 OOMs probably wasn't enough, then my credence in the next 6 OOMs would be correspondingly lower).

To some extent this reply also partly addresses the concerns you raised about memory and bandwidth--I

... (read more)
4Daniel Kokotajlo18dThanks! Well, I agree that I didn't really do anything in my post to say how the "within 12 OOMs" credence should be distributed. I just said: If you distribute it like Ajeya does except that it totals to 80% instead of 50%, you should have short timelines. There's a lot I could say about why I think within 6 OOMs should have significant probability mass (in fact, I think it should have about as much mass as the 7-12 OOM range). But for now I'll just say this: If you agree with me re Question Two, and put (say) 80%+ probability mass by +12 OOMs, but you also disagree with me about what the next 6 OOMs should look like and think that it is (say) only 20%, then that means your distribution must look something like this: Probability distribution over how many extra OOMs of compute we need given current ideasEDIT to explain: Each square on this graph is a point of probability mass. The far-left square represents 1% credence in the hypothesis "It'll take 1 more OOM." The second-from the left represents "It'll take 2 more OOM." The third-from-the-left is a 3% chance it'll take 3 more OOM, and so on. The red region is the region containing the 7-12 OOM hypotheses. Note that I'm trying to be as charitable as I can when drawing this! I only put 2% mass on the far right (representing "not even recapitulating evolution would work!"). This is what I think the probability distribution of someone who answered 80% to my Question Two should look like if they really really don't want to believe in short timelines. Even on this distribution, there's a 20% chance of 6 or fewer OOMs being enough given current ideas/algorithms/etc. (And hence, about a 20% chance of AGI/TAI/etc. by 2030, depending on how fast you think we'll scale up and how much algorithmic progress we'll make.) And even this distribution looks pretty silly to me. Like, why is it so much more confident that 11 OOMs will be how much we need, than 13 OOMs? Given our current state of ignorance about AI, I think the slo
Vanessa Kosoy's Shortform

However, it can do much better than that, by short-term quantilizing w.r.t. the user's reported success probability (with the user's policy serving as baseline). When quantilizing the short-term policy, we can upper bound the probability of corruption via the user's reported probability of short-term failure (which we assume to be low, i.e. we assume the malign AI is not imminent). This allows the AI to find parameters under which quantilization is guaranteed to improve things in expectation.

I don't understand what you mean here by quantilizing. The meanin... (read more)

2Vanessa Kosoy17dThe distribution is the user's policy, and the utility function for this purpose is the eventual success probability estimated by the user (as part of the timeline report), in the end of the "maneuver". More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it [] for MDPs.
Generalizing Power to multi-agent games

Glad to be helpful!

I go into more detail in my answer to Alex, but what I want to say here is that I don't feel like you use the power-scarcity idea enough in the post itself. As you said, it's one of three final notes, and without any emphasis on it.

So while I agree that the power-scarcity is an important research question, it would be helpful IMO if this post put more emphasis on that connection.

Generalizing Power to multi-agent games

Thanks for the detailed reply!

I want to go a bit deeper into the fine points, but my general reaction is "I wanted that in the post". You make a pretty good case for a way to come around at this definition that makes it particularly exciting. On the other hand, I don't think that stating a definition and proving a single theorem that has the "obvious" quality (whether or not it is actually obvious, mind you) is that convincing.

The best way to describe my interpretation is that I feel that you two went for the "scientific paper" style, but the current state... (read more)

Generalizing Power to multi-agent games

Ok, that's fair. It's hard to know which notation is common knowledge, but I think that adding a sentence explaining this one will help readers who haven't studied game theory formally.

Maybe making all vector profiles bold (like for the action profile) would help to see at a glance the type of the parameter. If I had seen it was a strategy profile, I would have inferred immediately what it meant.

Generalizing Power to multi-agent games

Exciting to see new people tackling AI Alignment research questions! (and I'm already excited by what Alex is doing, so him having more people work in his kind of research feels like a good thing).

That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalizatio... (read more)

4Daniel Kokotajlo17d"I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals." IMO, the unintuitive and potentially problematic thing is not that in a zero-sum game playing better makes things worse for everybody else. That part is fine. The unintuitive and potentially problematic thing is that, according to this formalism, the total collective Power is greater the worse everybody plays. This seems adjacent to saying that everybody would be better off if everyone played poorly, which is true in some games (maybe) but definitely not true in zero-sum games. (Right? This isn't my area of expertise) EDIT: Currently I suppose what you'd say is that power =/= utility, and so even though we'd all have more power if we were all less competent, we wouldn't actually be better off. But perhaps a better way forward would be to define a new concept of "Useful power" or something like that, which equals your share of the total power in a zero-sum game. Then we could say that everyone getting less competent wouldn't result in everyone becoming more usefully-powerful, which seems like an important thing to be able to say. Ideally we could just redefine power that way instead of inventing a new concept of useful power, but maybe that would screw up some of your earlier theorems?
7Jacob Stavrianos22dThank you so much for the comments! I'm pretty new to the platform (and to EA research in general), so feedback is useful for getting a broader perspective on our work. To add to TurnTrout's comments about power-scarcity and the CCC [], I'd say that the broader vision of the multi-agent formulation is to establish a general notion of power-scarcity as a function of "similarity" between players' reward functions (I mention this in the post's final notes). In this paradigm, the constant-sum case is one limiting case of "general power-scarcity", which I see as the "big idea". As a simple example, general power-scarcity would provide a direct motivation for fearing robustly instrumental goals, since we'd have reason to believe an AI with goals orthogonal(ish) from human goals would be incentivized to compete with humanity for Power. We're planning to continue investigating multi-agent Power and power-scarcity, so hopefully we'll have a more fleshed-out notion of general power-scarcity in the months to come. Also, re: "as players' strategies improve, their collective Power tends to decrease", I think your intuition is correct? Upon reflection, the effect can be explained reasonably well by "improving your actions has no effect on your Power, but a negative effect on opponents' Power".

Thanks so much for your comment! I'm going to speak for myself here, and not for Jacob.

That being said, I'm a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don't feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I'm just

... (read more)
4Alex Turner22dProbably going to reply to the rest later (and midco can as well, of course), but regarding: Using "σ−i" to mean "the strategy profile of everyone but playeri" is common notation; I remember it being used in 2-3 game theory textbooks I read, and you can see its prominence by consulting the Wikipedia page for Nash equilibrium []. Do I agree this is horrible notation? Meh. I don't know. But it's not a convention we pioneered in this work.
Against evolution as an analogy for how humans will create AGI

Just wanted to say that this comment made me add a lot of things on my reading list, so thanks for that (but I'm clearly not well-read enough to go into the discussion).

2gwern22dFurther reading: [] []
My research methodology

Thanks for writing this! I'm quite excited by learning more about your meta-agenda and your research process, and this reading stimulated me about my own research process.

But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

So you don't think that we could have a result of the sort "with these empirical facts, egregious misalignment is either certain or very hard to defend against, and so we should push towards not building AIs th... (read more)

6Paul Christiano23dYou get to iterate fast until you find an algorithm where it's hard to think of failure stories. And you get to work on toy cases until you find an algorithm that actually works in all the toy cases. I think we're a long way from meeting those bars, so that we'll get to iterate fast for a while. After we meet those bars, it's an open question how close we'd be to something that actually works. My suspicion is that we'd have the right basic shape of an algorithm (especially if we are good at thinking of possible failures). I feel like these distinctions aren't important until we get to an algorithm for which we can't think of a failure story (which feels a long way off). At that point the game kind of flips around, and we try to come up with a good story for why it's impossible to come up with a failure story. Maybe that gives you a strong security argument. If not, then you have to keep trying on one side or the other, though I think you should definitely be starting to prioritize applied work more.
Behavioral Sufficient Statistics for Goal-Directedness

To people reading this thread: we had a private conversation with John (faster and easier), which resulted in me agreeing with you.

The summary is that you can see the arguments made and constraints invoked as a set of equations, such that the adequate formalization is a solution of this set. But if the set has more than one solution (maybe a lot), then it's misleading to call that the solution.

So I've been working these last few days at arguing for the properties (generalization, explainability, efficiency) in such a way that the corresponding set of equations only has one solution.

2johnswentworth1moI'm working on writing it up properly, should have a post at some point. EDIT: it's up [] .
Epistemological Framing for AI Alignment Research

Thanks for the feedback!

Who? It would be helpful to have some links so I can go read what they said.

That was one of my big frustrations when writing this post: I only saw this topic pop up in personal conversation, not really in published posts. And so I didn't want to give names of people who just discussed that with me on a zoom call or in a chat. But I totally feel you -- I'm always annoyed by posts that pretend to answer a criticism without pointing to it.

On this more complicated (but IMO more accurate) model, your post is itself an attempt to make AI

... (read more)
Suggestions of posts on the AF to review

If we do only one, which one do you think matters the most?

2Daniel Kokotajlo1moI'm more interested in feedback on the +12 OOMs one because it's more decision-relevant. It's more of a fuzzy thing, not crunchy logic like the first one I recommended, and therefore less suitable for your purposes (or so I thought when I first answered your question, now I am not sure)
Behavioral Sufficient Statistics for Goal-Directedness

Thanks for commenting on your reaction to this post!

That being said, I'm a bit confused by your comment. You seem to write off approaches which attempt to provide a computational model of mind, but my approach is literally the opposite: looking only at the behavior (but all the behavior), extract relevant statistics to study questions related to goal-directedness.

Can you maybe give more details?

Behavioral Sufficient Statistics for Goal-Directedness

Thanks for the spot-on pushback!

I do understand what a sufficient statistics is -- which probably means I'm even more guilty of what you're accusing me of. And I agree completely that I don't defend correctly that the statistics I provide are really sufficient.

If I try to explain myself, what I want to say in this post is probably something like

  • Knowing these intuitive properties about  and the goals seems sufficient to express and address basically any question we have related to goals and goal-directedness. (in a very vague intuitive way that I
... (read more)

I still feel like you're missing something important here.

For instance... in the explainability factor, you measure "the average deviation of  from the actions favored by the action-value function  of ", using the formula


. But why this particular formula? Why not take the log of  first, or use  in the denominator? Indeed, there's a strong argument to be made this formula is a bad choice: the value function  is... (read more)

Towards a Mechanistic Understanding of Goal-Directedness

Nice post! Surprisingly, I'm interested in the topic. ^^

Funny too that you focus on an idea I am writing a post about (albeit from a different angle). I think I broadly agree with your conjectures, for sufficient competence and generalization at least.

Most discussion about goal-directed behavior has focused on a behavioral understanding, which can roughly be described as using the intentional stance to predict behavior.

I'm not sure I agree with that. Our lit review shows that there are both behavioral and mechanistic approaches (Richard's goal-directed age... (read more)

Book review: "A Thousand Brains" by Jeff Hawkins

Thanks for the nice review! It's great to have the reading of someone who understand enough the current state of neuroscience to point to aspects of the book at odds with neuroscience consensus. My big takeaway is that I should look a bit more into neuroscience based approaches to AGI, because they might be important, and require different alignment approaches.

On a more rhetorical level, I'm impressed by how you manage to make me ask a question (okay, but what evidence is there for this uniformity of the neocortex) and then points to some previous work you... (read more)

1Steve Byrnes1moThanks! My opinion is: I think if you want to figure out the gory details of the neocortical algorithm, and you want to pick ten authors to read, then Jeff Hawkins should be one of them. If you're only going to pick one author, I'd go with Dileep George. I'm happy to chat more offline. Well there's an inside-view argument that it's human-legible because "It basically works like, blah blah blah, and that algorithm is human-legible because I'm a human and I just legibled it." I guess that's what Jeff would say. (Me too.) Then there's an outside-view argument that goes "most of the action is happening within a "cortical mini-column", which consists of about 100 neurons mostly connected to each other. Are you really going to tell me that 100 neurons implements an algorithm that is so complicated that it's forever beyond human comprehension? Then again, BB(5) [] is still unknown, so circuits with a small number of components can be quite complicated. So I guess that's not all that compelling an argument on its own. I think a better outside-view argument is that if one algorithm is really going to learn how to parse visual scenes, put on a shoe, and design a rocket engine ... then such an algorithm really has to work by simple, general principles —things like “if you’ve seen something, it’s likely that you’ll see it again”, and “things are often composed of other things”, and "things tend to be localized in time and space", and TD learning, etc. Also, GPT-3 shows that human-legible learning algorithms are at least up to the task of learning language syntax and semantics, plus learning quite a bit of knowledge about how the world works. For common sense, my take is that it's plausible that a neocortex-like AGI will wind up with some of the same concepts as humans, in certain areas and under certain conditions. That's a hard thing to guarantee a priori, and therefore I'm not quite sure what that buys you. For morals, there is
The case for aligning narrowly superhuman models

Thanks for the very in-depth case you're making! I especially liked the parts about the objections, and your take on some AI Alignment researcher's opinions of this proposal.

Personally, I'm enthusiastic about it with caveats expanded below. If I try to interpret your proposal according to the lines of my recent epistemological framing of AI Alignment research, you're pushing for a specific kind of work on the Solving part of the field, where you assume a definition of the terms of the problem (what AIs will we build and what do we want). My caveats can be ... (read more)

The case for aligning narrowly superhuman models

Well, Paul's original post presents HCH as the specification of a human enlightened judgement.

For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.

And if we follow the links to Paul's previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time.

To define my considered judgment about a question Q, suppose I am told Q and spend

... (read more)
Full-time AGI Safety!

Welcome in the (for now) small family of people funded by Beth! Your research looks pretty cool, and I'm quite excited when seeing how different it is from mine. So Beth is funding quite a wide range of researchers, which is what makes most sense to me. :)

Behavioral Sufficient Statistics for Goal-Directedness

Thanks for telling me! I've changed that.

It might be because I copied and pasted the first sentence to each subsection.

Behavioral Sufficient Statistics for Goal-Directedness

Thanks for taking the time to give feedback!

Technical comment on the above post

So if I understand this correctly. then  is a metric of goal-directedness. However, I am somewhat puzzled because  only measures directedness to the single goal .

But to get close to the concept of goal-directedness introduced by Rohin, don't you need then do an operation over all possible values of ?

That's not what I had in mind, but it's probably on me for not explaining it clearly enough.

  • First, for a fixed goal , the whole focus
... (read more)
1Koen Holtman1moI was not trying to summarize the entire sequence, only summarizing my impressions of some things he said in the first post of the sequence. Those impressions are that Rohin was developing his intuitive notion of goal-directedness in a very different direction than you have been doing, given the examples he provides. Which would be fine, but it does lead to questions of how much your approach differs. My gut feeling is that the difference in directions might be much larger than can be expressed by the mere adjective 'behavioral'. On a more technical note, if your goal is to search for metrics related to "less probability that the AI steals all my money to buy hardware and goons to ensure that it can never be shutdown", then the metrics that have been most productive in my opinion are, first, 'indifference', in the meaning where it is synonymous with 'not having a control incentive []'. Other very relevant metrics are 'myopia' or 'short planning horizons' (see for example here [] ) and 'power' (see my discussion in the post Creating AGI Safety Interlocks []). (My paper counterfactual planning [] has a definition of 'indifference' which I designed to be more accessible than the `not having a control incentive []' definition, i.e. more accessible for people not familiar with Pearl's math.) None of the above metrics look very much like 'non-goal-directedness' to me, with the possible exception of myopia.
adamShimi's Shortform

Thanks for the idea! I agree that it probably helps, and it solves my issue with the state of knowledge of the other.

That being said, I don't feel like this solves my main problem: it still feel to me as pushing too hard. Here the reason is that I post on a small venue (rarely more than a few posts per day) that I know the people I'm asking feedback too read regularly. So if I send them such a message at the moment I publish, it feels a bit like I'm saying that they wouldn't read and comment it without that, which is a bit of a problem.

(I'm interested to k... (read more)

adamShimi's Shortform

curious for more detail on “what feels wrong about explicitly asking individuals for feedback after posting on AF” similar to how you might ask for feedback on a gDoc?

My main reason is steve's first point:

  1. Maybe there's a sense in which everyone has already implicitly declared that they don't want to give feedback, because they could have if they wanted to, so it feels like more of an imposition.

Asking someone for feedback on work posted somewhere I know they read feels like I'm whining about not having feedback (and maybe whining about them not giving me f... (read more)

4Vaughn Papenhausen2moCould this be solved just by posting your work and then immediately sharing the link with people you specifically want feedback from? That way there's no expectation that they would have already seen it. (Granted, this is slightly different from a gdoc in that you can share a gdoc with one person, get their feedback, then share with another person, while what I suggested requires asking everyone you want feedback from all at once.)
adamShimi's Shortform

Right now, the incentives to get useful feedback on my research push me to go into the opposite policy that I would like: publish on the AF as late as I can allow.

Ideally, I would want to use the AF as my main source of feedback, as it's public, is read by more researchers that I know personally, and I feel that publishing there helps the field grow.

But I'm forced to admit that publishing anything on the AF means I can't really send it to people anymore (because the ones I ask for feedback read the AF, so that's feels wrong socially), and yet I don't get a... (read more)

I think there are a number for features LW could build to improve this situation, but first curious for more detail on “what feels wrong about explicitly asking individuals for feedback after posting on AF” similar to how you might ask for feedback on a gDoc?

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

But you didn't actually give an argument for that -- you simply stated it. As a matter of fact, I disagree: it seems really easy for an AGI to misunderstand what I mean when I use english words. To go back to the "fusion power generator", maybe it has a very deep model of such generators that abstracts away most of the c... (read more)

Formal Solution to the Inner Alignment Problem

Thanks for sharing this work!

Here's my short summary after reading the slides and scanning the paper.

Because human demonstrator are safe (in the sense of almost never doing catastrophic actions), a model that imitates closely enough the demonstrator should be safe. The algorithm in this paper does that by keeping multiple models of the demonstrator, sampling the top models according to a parameter, and following what the sampled model does (or querying the demonstrator if the sample is "empty"). The probability that this algorithm does a very unlikely acti

... (read more)
1Vanessa Kosoy2moI don't think this is a lethal problem. The setting is not one-shot, it's imitation over some duration of time. IDA just increases the effective duration of time, so you only need to tune how cautious the learning is (which I think is controlled by α in this work) accordingly: there is a cost, but it's bounded. You also need to deal with non-realizability (after enough amplifications the system is too complex for exact simulation, even if it wasn't to begin with), but this should be doable using infra-Bayesianism (I already have some notion how that would work). Another problem with imitation-based IDA is that external unaligned AI might leak into the system either from the future or from counterfactual scenarios in which such an AI is instantiated. This is not an issue with amplifying by parallelism (like in the presentation) but at the cost of requiring parallelizability.
1Charlie Steiner2moI think this looks fine for IDA - the two problems remain the practical one of implementing Bayesian reasoning in a complicated world, and the philosophical one that probably IDA on human imitations doesn't work because humans have bad safety properties.
2michaelcohen2moEdited to clarify. Thank you for this comment. 100% agree. Intelligent agency can be broken into intelligent prediction and intelligent planning. This work introduces a method for intelligent prediction that avoids an inner alignment failure. The original concern about inner alignment was that an idealized prediction algorithm (Bayesian reasoning) could be commandeered by mesa-optimizers. Idealized planning, on the other hand is an expectimax tree, and I don't think anyone has claimed mesa-optimizers could be introduced by a perfect planner. I'm not sure what it would even mean. There is nothing internal in the expectimax algorithm that could make the output something other than what the prediction algorithm would agree is the best plan. Expectimax, by definition, produces a policy perfectly aligned with the "goals" of the prediction algorithm. Tl;dr: I think that in theory, the inner alignment problem regards prediction, not planning, so that's the place to test solutions. If you want to see the inner alignment problem neutralized in the full RL setup, you can see we use a similar approach in this [] agent's prediction subroutine. So you can maybe say that work solved the inner alignment problem. But we didn't prove finite error bounds the way we have here, and I think RL setup obscures the extent to which rogue predictive models are dismissed, so it's a little harder to see than it is here. Not exactly. No models are ever sampled. The top models are collected, and they all can contribute to the estimated probabilities of actions. Then an action is sampled according to those probabilities, which sum to less than one, and queries the demonstrator if it comes up empty. Yes, that's right. The bounds should chain together, I think, but they would definitely grow.
Suggestions of posts on the AF to review

Thanks for the suggestion! It's great to have some methodological posts!

We'll consider it. :)

Suggestions of posts on the AF to review

Thanks for the suggestion!

I didn't know about this post. We'll consider it. :)

Suggestions of posts on the AF to review

Thanks for the suggestion!

We want to go through the different research agendas (and I already knew about yours), as they give different views/paradigms on AI Alignment. Yet I'm not sure how relevant a review of such posts are. In a sense, the "reviewable" part is the actual research that underlies the agenda, right?

1Joe_Collman2moI don't see a good reason to exclude agenda-style posts, but I do think it'd be important to treat them differently from more here-is-a-specific-technical-result posts. Broadly, we'd want to be improving the top-level collective AI alignment research 'algorithm'. With that in mind, I don't see an area where more feedback/clarification/critique of some kind wouldn't be helpful. The questions seem to be: What form should feedback/review... take in a given context? Where is it most efficient to focus our efforts? Productive feedback/clarification on high-level agendas seems potentially quite efficient. My worry would be to avoid excessive selection pressure towards paths that are clear and simply justified. However, where an agenda does use specific assumptions and arguments to motivate its direction, early 'review' seems useful.
Suggestions of posts on the AF to review

I was indeed expecting you to suggest one of your post. But that's one of the valid reasons I listed, and I didn't know about this one, so it's great!

We'll consider it. :)

1Daniel Kokotajlo1moInsofar as you want to do others of mine, my top recommendation would be this one [] since it got less feedback than I expected and is my most important timelines-related post of all time IMO.
Suggestions of posts on the AF to review

But sometimes, you want to be like "come at me bro". You've got something that you're pretty highly confident is right, and you want people to really try to shoot it down (partly as a social mechanism to demonstrate that the idea is in fact as solid and useful as you think it is). This isn't something I'd want to be the default kind of feedback, but I'd like for authors to be able to say "come at me bro" when they're ready for it, and I'd like for posts which survive such a review to be perceived as more epistemically-solid/useful.

Yeah, when I think about ... (read more)

Tournesol, YouTube and AI Risk

If the main source of revenue is people buying stuff after seeing an ad on YouTube, then I agree with your point in the middle of the comment, that it seems hardly possible for the revenue to go 1.5 OOMs more by only 2OOMs on model size. I bet that there would be a big discontinuity here, where you need massive investment to actually see any significant improvement.

On the other hand, if the main source of revenue is money payed for the number of views of ads, then I believe a better model could improve that relatively smoothly. In part because just giving people interesting stuff to see makes them look at more ads.

1Daniel Kokotajlo2moIsn't there a close connection between money payed for number of views of ads and people buying stuff after seeing an ad on YouTub? I thought that the situation is something like this: People see ads and buy stuff --> Data is collected on how much extra money the ad brought in --> youtube charges advertisers accordingly. The only way for youtube to charge advertisers significantly more is first for people to buy significantly more stuff as a result of seeing ads.
Tournesol, YouTube and AI Risk

I suspect the best way to think about the polarizing political content thing which is going on right now is something like: The algorithm knows that if it recommends some polarizing political stuff, there's some chance you will head down a rabbit hole and watch a bunch more vids. So in terms of maximizing your expected watch time, recommending polarizing political stuff is a good bet. "Jumping out of the system" and noticing that recommending polarizing videos also tends to polarize society as a whole and gets them to spend more time on Youtube on a macro

... (read more)
1John Maxwell2moMakes sense. I think it might be useful to distinguish between being aware of oneself in a literal sense, and the term "self-aware" as it is used colloquially / the connotations the term sneaks in. Some animals, if put in front of a mirror, will understand that there is some kind of moving animalish thing in front of them. The ones that pass the mirror test are the ones that realize that moving animalish thing is them. There is a lot of content on YouTube about YouTube, so the system will likely become aware of itself in a literal sense. That's not the same as our colloquial notion of "self-awareness". IMO, it'd be useful to understand the circumstances under which the first one leads to the second one. My guess is that it works something like this. In order to survive and reproduce, evolution has endowed most animals with an inborn sense of self, to achieve self-preservation. (This sense of self isn't necessary for cognition--if you trip on psychedelics and experience ego death, your brain can still think. Occasionally people will hurt themselves in this state since their self-preservation instincts aren't functioning as normal.) Colloquial "self-awareness" occurs when an animal looking in the mirror realizes that the thing in the mirror and its inborn sense of self are actually the same thing. Similar to Benjamin Franklin realizing that lightning and electricity are actually the same thing. If this story is correct, we need not worry much about the average ML system developing "self-awareness" in the colloquial sense, since we aren't planning to endow it with an inborn sense of self. That doesn't necessarily mean I think Predict-O-Matic is totally safe. See this post I wrote [] for instance.
Tournesol, YouTube and AI Risk

Thanks for the feedback.

Your argument as I understand it is: the economic incentive to make the model bigger might disappear if the cost of computing the recommendation outweighs the gain of  having "better" recommendations.

I think this is definitely relevant, but I don't feel like I have enough information to decide if the argument holds or not. Notably, it goes back to the parameter that we discussed in a call: whether increasing the model size/compute/dataset size improves the performance for the real world task until AGI is reached, or whether the... (read more)

2Daniel Kokotajlo2moYes, we care about what YouTube makes, not what youtubers make. My brief google didn't turn up anything about what YouTube makes but I assume it's not more than a few times greater than what youtubers make... but I might be wrong about that! I agree we don't have enough information to decide if the argument holds or not. I think that even if bigger models are always qualitatively better, the issue is whether the monetary returns outweigh the increasing costs. I suspect they won't, at least in the case of the youtube algo. Here's my argument I guess, in more detail: 1. Suppose that currently the cost of compute for the algo is within an OOM of the revenue generated by it. (Seems plausible to me but I don't actually know) 2. Then to profitably scale up the algo by, say, 2 ooms, the money generated by the algo would have to go up by, like, 1.5 ooms. 3. But it's implausible that a 2-oom increase in size of algo would result in that much increase in revenue. Like, yeah, the ads will be better targeted, people will be spending more, etc. But 1.5 OOMs more? When I imagine a world where Youtube viewers spend 10x more money as a result of youtube ads, I imagine those ads being so incredibly appealing that people go to youtube just to see the ads because they are so relevant and interesting. And I feel like that's possible, but it's implausible that making the model 2 ooms bigger would yield that result. ... you know now that I write it out, I'm no longer so sure! GPT-3 was a lot better than GPT-2, and it was 2 OOMs bigger. Maybe youtube really could make 1.5 OOMs more revenue by making their model 2 OOMs bigger. And then maybe they could increase revenue even further by making it bigger still, etc. on up to AGI.
Epistemology of HCH

I think it's a good summary. Thanks!

Distinguishing claims about training vs deployment

This looked exciting when you mentioned it, and it doesn't disappoint.

To check that I get it, here is my own summary:

Because ML looks like the most promising approach to AGI at the moment, we should adapt and/or instantiate the classical arguments for AI risks to a ML context. The main differences are the separation of a training and a deployment phase and the form taken by the objective function (mix of human and automated feedback from data instead of hardcoded function).

  • (Orthogonality thesis) Even if any combination of goal and intelligence can exist in
... (read more)
2Richard Ngo2moThanks for the feedback! Some responses: I don't really know what "model-based" means in the context of AGI. Any sufficiently intelligent system will model the world somehow, even if it's not trained in a way that distinguishes between a "model" and a "policy". (E.g. humans weren't.) I'll steal Ben Garfinkel's response to this. Suppose I said that "almost all possible ways you might put together a car don't have a steering wheel". Even if this is true, it tells us very little about what the cars we actually build might look like, because the process of building things picks out a small subset of all possibilities. (Also, note that the instrumental convergence thesis doesn't say "almost all goals", just a "wide range" of them. Edit: oops, this was wrong; although the statement of the thesis given by Bostrom doesn't say that, he says "almost all" in the previous paragrah.)
A Critique of Non-Obstruction

I'm not Alex, but here's my two cents.

I think your point 2 is far less obvious too me, especially without a clear-cut answer to the correctness of the strategy-stealing assumption. Because I agree that we might optimize the wrong goals, but I don't see why we would optimize some necessarily more than others. So each goal in S might have a spike (for a natural set of goals that are all similarly difficult to specify) and the resulting landscape would be flat.

That being said, I think you're pointing towards an interesting fact about the original post: in it,... (read more)

1Joe_Collman2moOh it's possible to add up a load of spikes [ETA suboptimal optimisations], many of which hit the wrong target, but miraculously cancel out to produce a flat landscape [ETA "spikes" was just wrong; what I mean here is that you could e.g. optimise for A, accidentally hit B, and only get 70% of the ideal value for A... and counterfactually optimise for B, accidentally hit C, and only get 70% of the ideal value for B... and counterfactually aim for C, hit D etc. etc. so things end up miraculously flat; this seems silly because there's no reason to expect all misses to be of similar 'magnitude', or to have the same impact on value]. It's just hugely unlikely. To expect this would seem silly. [ETA My point is that in practice we'll make mistakes, that the kind/number/severity of our mistakes will be P dependent, and that a pol which assumes away such mistakes isn't useful (at least I don't see how it'd be useful). Throughout I'm assuming pol(P) isn't near-optimal for all P - see my response above [] for details] For non-spikiness, you don't just need a world where we never use powerful AI: you need a world where powerful [optimisers for some goal in S] of any kind don't occur. It's not clear to me how you cleanly/coherently define such a world. The counterfactual where "this system is off" may not be easy to calculate, but it's conceptually simple. The counterfactual where "no powerful optimiser for any P in S ever exists" is not. In particular, it's far from clear that iterated improvements of biological humans with increased connectivity don't get you an extremely powerful optimiser - which could (perhaps mistakenly) optimise for something spikey. Ruling everything like this out doesn't seem to land you anywhere natural or cleanly defined. Then you have the problem of continuing non-obstruction once many other AIs already exist: You build a non-obstructiv
Counterfactual control incentives

This post gives two distinct (but related) "pieces of knowledge".

  • A counterexample to the "counterfactual incentive algorithm" described in section 5.2 of The Incentives that Shape Behaviour. Moreover, this failure seems to generalize to any causal diagram where all paths from the decision node to the utility node contain a control incentive, and where the controlled variables have mutual information that forbid applying the counterfactual only to some.
  • A concrete failure mode for the task of ensuring that a causal diagram fits a concrete situation: arrows w
... (read more)
1Koen Holtman2moI think you are using some mental model where 'paths with nodes' vs. 'paths without nodes' produces a real-world difference in outcomes. This is the wrong model to use when analysing CIDs. A path in a diagram -->[node]--> can always be replaced by a single arrow --> to produce a model that makes equivalent predictions, and the opposite operation is also possible. So the number of nodes on a path better read as a choice about levels of abstraction in the model, not as something that tells us anything about the real world. The comment I just posted with the alternative development of the game model may be useful for you here, it offers a more specific illustration of adding nodes.
Against the Backward Approach to Goal-Directedness

Thanks for both your careful response and the pointer to Conceptual Engineering!

I believe I am usually thinking in terms of defining properties for their use, but it's important to keep that in mind. The post on Conceptual Engineering lead me to this follow up interview, which contains a great formulation of my position:

Livengood: Yes. The best example I can give is work by Joseph Halpern, a computer scientist at Cornell. He's got a couple really interesting books, one on knowledge one on causation, and big parts of what he's doing are informed by the long

... (read more)
Literature Review on Goal-Directedness

I think the disagreement left is whether we should first find a definition of goal-directedness then study how it appears through training (my position), or if we should instead define goal-directedness according to the kind of training processes that generate similar properties and risks (what I take to be your position).

Does that make sense to you?

2Richard Ngo3moKinda, but I think both of these approaches are incomplete. In practice finding a definition and studying examples of it need to be interwoven, and you'll have a gradual process where you start with a tentative definition, identify examples and counterexamples, adjust the definition, and so on. And insofar as our examples should focus on things which are actually possible to build (rather than weird thought experiments like blockhead or the chinese room) then it seems like what I'm proposing has aspects of both of the approaches you suggest. My guess is that it's more productive to continue discussing this on my response to your other post [] , where I make this argument in a more comprehensive way.
Literature Review on Goal-Directedness

Thanks for the inclusion in the newsletter and the opinion! (And sorry for taking so long to answer)

This literature review on goal-directedness identifies five different properties that should be true for a system to be described as goal-directed:

It's implicit, but I think it should be made explicit that the properties/tests are what we extract from the literature, not what we say is fundamental. More specifically, we don't say they should be true per se, we just extract and articulate them to "force" a discussion of them when defining goal-directedness.


... (read more)
6Rohin Shah3moChanged to "This post extracts five different concepts that have been identified in the literature as properties of goal-directed systems:". Deleted that sentence.
Load More