All of DanielFilan's Comments + Replies

Reward is not the optimization target

Here is an example of a story I wrote (that is somewhat edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won't end up wanting to game human approval:

  • Agent gets trained on a reward
... (read more)
Reward is not the optimization target

dopamine or RPE or that-which-gets-discounted-and-summed-to-produce-the-return

Those are three pretty different things - the first is a chemical, the second I guess stands for 'reward prediction error', and the third is a mathematical quantity! Like, you also can't talk about the expected sum of dopamine, because dopamine is a chemical, not a number!

Here's how I interpret the paper: stuff in the world is associated with 'rewards', which are real numbers that represent how good the stuff is. Then the 'return' of some period of time is the discounted sum o... (read more)

Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

(see also this shortform, which makes a rudimentary version of the arguments in the first two subsections)

Reward is not the optimization target

Here's my general view on this topic:

  • Agents are reinforced by some reward function.
  • They then get more likely to do stuff that the reward function rewards.
  • This process, iterated a bunch, produces agents that are 'on-distribution optimal'.
  • In particular, in states that are 'easily reached' during training, the agent will do things that approximately maximize reward.
  • Some states aren't 'easily reached', e.g. states where there's a valid bitcoin blockchain of length 20,000,000 (current length as I write is 748,728), or states where you have messed around w
... (read more)
Reward is not the optimization target

I'm not saying "These statements can make sense", I'm saying they do make sense and are correct under their most plain reading.

Re: a possible goal of animals being to optimize the expected sum of future rewards, in the cited paper "rewards" appears to refer to stuff like eating tasty food or mating, where it's assumed the animal can trade those off against each other consistently:

Decision-making environments are characterized by a few key concepts: a state space..., a set of actions..., and affectively important outcomes (finding cheese, obtaining water,

... (read more)
2Alex Turner3d
Yup, strong disagree with that. If that were true, that would definitely be a good counterpoint and mean I misread it. If so, I'd retract my original complaint with that passage. But I'm not convinced that it's true. The previous paragraph just describes finding cheese as an "affectively important outcome." Then, later, "outcomes are assumed to have numerical... utilities." So they're talking about utility now, OK. But then they talk about rewards. Is this utility? It's not outcomes (like finding cheese), because you can't take the expected sum of future finding-cheeses -- type error! When I ctrl+F rewards and scroll through, and it sure seems like they're talking about dopamine or RPE or that-which-gets-discounted-and-summed-to-produce-the-return, which lines up with my interpretation.
Reward is not the optimization target

I think the quotes cited under "The field of RL thinks reward=optimization target" are all correct. One by one:

The agent's job is to find a policy… that maximizes some long-run measure of reinforcement.

Yes, that is the agent's job in RL, in the sense that if the training algorithm didn't do that we'd get another training algorithm (if we thought it was feasible for another algorithm to maximize reward). Basically, the field of RL uses a separation of concerns, where they design a reward function to incentivize good behaviour, and the agent maximizes th... (read more)

2Alex Turner11d
I perceive you as saying "These statements can make sense." If so, the point isn't that they can't be viewed as correct in some sense—that no one sane could possibly emit such statements. The point is that these quotes are indicative of misunderstanding the points of this essay. That if someone says a point as quoted, that's unfavorable evidence on this question. I wasn't implying they're impossible, I was implying that this is somewhat misguided. Animals learn to achieve goals like "optimizing... the expected sume of future rewards"? That's exactly what I'm arguing against as improbable.
Law-Following AI 4: Don't Rely on Vicarious Liability

It looks like this is the 4th post in a sequence - any chance you can link to the earlier posts? (Or perhaps use LW's sequence feature)

3Cullen_OKeefe14d
Thanks, done. LW makes it harder than EAF to make sequences, so I didn't realize any community member could do so.
Coherence arguments do not entail goal-directed behavior

I have no idea why I responded 'low' to 2. Does anybody think that's reasonable and fits in with what I wrote here, or did I just mean high?

2Rohin Shah2mo
"random utility-maximizer" is pretty ambiguous; if you imagine the space of all possible utility functions over action-observation histories and you imagine a uniform distribution over them (suppose they're finite, so this is doable), then the answer is low [https://www.lesswrong.com/posts/hzeLSQ9nwDkPc4KNt/seeking-power-is-convergently-instrumental-in-a-broad-class#Instrumental_Convergence_Disappears_For_Utility_Functions_Over_Action_Observation_Histories] . Heh, looking at my comment [https://www.lesswrong.com/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-entail-goal-directed-behavior?commentId=fNrPnZdkTqL3LH4ez] it turns out I said roughly the same thing 3 years ago.
Project Intro: Selection Theorems for Modularity

The method that is normally used for this in the biological literature (including the Kashtan & Alon paper mentioned above), and in papers by e.g. CHAI dealing with identifying modularity in deep modern networks, is taken from graph theory. It involves the measure Q, which is defined as follows:

FWIW I do not use this measure in my papers, but instead use a different graph-theoretic measure. (I also get the sense that Q is more of a network theory thing than a graph theory thing)

rohinmshah's Shortform

I think it's more concerning in cases where you're getting all of your info from goal-oriented behaviour and solving the inverse planning problem

It's also not super clear what you algorithmically do instead - words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.

2Rohin Shah4mo
That's what future research is for!
rohinmshah's Shortform

One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.

I think this is way more worrying in the case where you're implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.

Though [the claim that slightly wrong observation

... (read more)
2Rohin Shah4mo
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point). I agree Boltzmann rationality (over the action space of, say, "muscle movements") is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including "things that humans say", and the human can just tell you that hyperslavery is really bad. Obviously you can't trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes. (Ideally you'd figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of "getting a good observation model" while you still have the ability to turn off the model. It's hard to say exactly what that would look like since I don't have a great sense of how you get AGI capabilities under the non-ML story.)
3DanielFilan4mo
It's also not super clear what you algorithmically do instead - words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document [https://www.lesswrong.com/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge] .
Job Offering: Help Communicate Infrabayesianism

A future episode might include a brief distillation of that episode ;)

2Steve Byrnes5mo
There was one paragraph from the podcast that I found especially enlightening—I excerpted it here (Section 3.2.3) [https://www.lesswrong.com/posts/SzrmsbkqydpZyPuEh/my-take-on-vanessa-kosoy-s-take-on-agi-safety#3_2_4_Infra_Bayesianism] .
Summary of the Acausal Attack Issue for AIXI

But wait, there can only be so many low-complexity universes, and if they're launching successful attacks, said attacks would be distributed amongst a far far far larger population of more-complex universes.

Can't you just condition on the input stream to affect all the more-complex universes, rather than targetting a single universe? Specifically: look at the input channel, run basically-Solomonoff-induction yourself, then figure out which universe you're being fed inputs of and pick outputs appropriately. You can't be incredibly powerful this way, sinc... (read more)

Introduction To The Infra-Bayesianism Sequence

That said, this sequence is tricky to understand and I'm bad at it! I look forward to brave souls helping to digest it for the community at large.

I interviewed Vanessa here in an attempt to make this more digestible: it hopefully acts as context for the sequence, rather than a replacement for reading it.

Shulman and Yudkowsky on AI progress

One thing Carl notes is that a variety of areas where AI could contribute a lot to the economy are currently pretty unregulated. But I think there's a not-crazy story where once you are within striking range of making an area way more efficient with computers, then the regulation hits. I'm not sure how to evaluate how right that is (e.g. I don't think it's the story of housing regulation), but just wanted it said.

Soares, Tallinn, and Yudkowsky discuss AGI cognition

Anders Sandberg could tell us what fraction of the reachable universe is being lost per minute, which would tell us how much more surety it would need to expect to gain by waiting another minute before acting.

From Ord (2021):

Each year the affectable universe will only shrink in volume by about one part in 5 billion.

So, since there are about 5e5 minutes in a year, you lose about 1 part in 5e5 * 5e9 = 3e15 every minute.

5Rob Bensinger7mo
I think the intended visualization is simply that you create a very small self-replicating machine, and have it replicate enough times in the atmosphere that every human-sized organism on the planet will on average contain many copies of it. One of my co-workers at MIRI comments: Regarding the idea of diamondoid nanotechnology, Drexler's Nanosystems and http://www.molecularassembler.com/Nanofactory/index.htm [http://www.molecularassembler.com/Nanofactory/index.htm] talk about the general concept.
Soares, Tallinn, and Yudkowsky discuss AGI cognition

Then, in my lower-bound concretely-visualized strategy for how I would do it, the AI either proliferates or activates already-proliferated tiny diamondoid bacteria and everybody immediately falls over dead during the same 1-second period

Dumb question: how do you get some substance into every human's body within the same 1 second period? Aren't a bunch of people e.g. in the middle of some national park, away from convenient air vents? Is the substance somehow everywhere in the atmosphere all at once?

(I wouldn't normally ask these sorts of questions since... (read more)

3DanielFilan7mo
Also: what is a diamondoid bacterium?
What exactly is GPT-3's base objective?

Expected return in a particular environment/distribution? Or not? If not, then you may be in a deployment context where you aren't updating the weights anymore and so there is no expected return

I think you might be misunderstanding this? My take is that "return" is just the discounted sum of future rewards, which you can (in an idealized setting) think of as a mathematical function of the future trajectory of the system. So it's still well-defined even when you aren't updating weights.

What exactly is GPT-3's base objective?

I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it's talking about. The problem is just that it's not general enough to handle all possible ways of training a model using machine learning.

GPT-3 was trained using supervised learning, which I would have thought was a pretty standard way of training a model using machine learning. What training scenarios do you think the Risks from Learned Optimization terminology can handle, and what's the difference between those and the way GPT-3 was trained?

4Evan Hubinger9mo
First, the problem is only with outer/inner alignment—the concept of unintended mesa-optimization is still quite relevant and works just fine. Second, the problems with applying Risks from Learned Optimization terminology to GPT-3 have nothing to do with the training scenario, the fact that you're doing unsupervised learning, etc. The place where I think you run into problems is that, for cases where mesa-optimization is intended in GPT-style training setups, inner alignment in the Risks from Learned Optimization sense is usually not the goal. Most of the optimism about large language models is hoping that they'll learn to generalize in particular ways that are better than just learning to optimize for something like cross entropy/predictive accuracy. Thus, just saying “if the model is an optimizer, it won't just learn to optimize for cross entropy/predictive accuracy/whatever else it was trained on,” while true, is unhelpful. What I like about training stories is that it explicitly asks what sort of model you want to get—rather than assuming that you want something which is optimizing for your training objective—and then asks how likely we are to actually get it (as opposed to some sort of mesa-optimizer, a deceptive model, or anything else).
AMA: Paul Christiano, alignment researcher

What changed your mind about Chaitin's constant?

3Paul Christiano10mo
I hadn't appreciated how hard and special it is to be algorithmically random.
Emergent modularity and safety

It's true! Altho I think of putting something up on arXiv as a somewhat lower bar than 'publication' - that paper has a bit of work left.

Welcome & FAQ!

I really like the art!

Finite Factored Sets: Orthogonality and Time

OK I think this is a typo, from the proof of prop 10 where you deal with condition 5:

Thus .

I think this should be .

2Scott Garrabrant1y
Fixed, Thanks.
Finite Factored Sets: Orthogonality and Time

From def 16:

... if for all

Should I take this to mean "if for all and "?

[EDIT: no, I shouldn't, since and are both subsets of ]

1DanielFilan1y
OK I think this is a typo, from the proof of prop 10 where you deal with condition 5: I think this should be χFC(x,s)⊆x.
A simple example of conditional orthogonality in finite factored sets

Seems right. I still think it's funky that X_1 and X_2 are conditionally non-orthogonal even when the range of the variables is unbounded.

4Scott Garrabrant1y
Yeah, this is the point that orthogonality is a stronger notion than just all values being mutually compatible. Any x1 and x2 values are mutually compatible, but we don't call them orthogonal. This is similar to how we don't want to say that x1 and (the level sets of) x1+x2 are compatible. The coordinate system has a collection of surgeries, you can take a point and change the x1 value without changing the other values. When you condition on E, that surgery is no longer well defined. However the surgery of only changing the x4 value is still well defined, and the surgery of changing x1 x2 and x3 simultaneously is still well defined (provided you change them to something compatible with E). We could define a surgery that says that when you increase x1, you decrease x2 by the same amount, but that is a new surgery that we invented, not one that comes from the original coordinate system.
AXRP Episode 9 - Finite Factored Sets with Scott Garrabrant

I'm glad to hear that the podcast is useful for people :)

Knowledge is not just mutual information

Seems like maybe the solution should perhaps be that you should only take 'the system' to be the 'controllable' physical variables, or those variables that are relevant for 'consequential' behaviour? Hopefully if one can provide good definitions for these, it will provide a foundation for saying what the abstractions should be that let us distinguish between 'high-level' and 'low-level' behaviour.

Challenge: know everything that the best go bot knows about go

Ah, understood. I think this is basically covered by talking about what the go bot knows at various points in time, a la this comment - it seems pretty sensible to me to talk about knowledge as a property of the actual computation rather than the algorithm as a whole. But from your response there it seems that you think that this sense isn't really well-defined.

1Richard Ngo1y
I'm not sure what you mean by "actual computation rather than the algorithm as a whole". I thought that I was talking about the knowledge of the trained model which actually does the "computation" of which move to play, and you were talking about the knowledge of the algorithm as a whole (i.e. the trained model plus the optimising bot).
Challenge: know everything that the best go bot knows about go

Actually, hmm. My thoughts are not really in equilibrium here.

AXRP Episode 7 - Side Effects with Victoria Krakovna

Not sure what the actual sentence you wanted to write was. "are not absolutely necessary" maybe?

You're quite right, let me fix that.

1DanielFilan1y
And also thanks for your kind words :)
Challenge: know everything that the best go bot knows about go

(Also: such a rewrite would be a combination of 'what I really meant' and 'what the comments made me realize I should have really meant')

Challenge: know everything that the best go bot knows about go

OK, the parenthetical helped me understand where you're coming from. I think a re-write of this post should (in part) make clear that I think a massive heroic effort would be necessary to make this happen, but sometimes massive heroic efforts work, and I have no special private info that makes it seem more plausible than it looks a priori.

1DanielFilan1y
Actually, hmm. My thoughts are not really in equilibrium here.
1DanielFilan1y
(Also: such a rewrite would be a combination of 'what I really meant' and 'what the comments made me realize I should have really meant')
Challenge: know everything that the best go bot knows about go

In the parent, is your objection that the trained AlphaZero-like model plausibly knows nothing at all?

0Richard Ngo1y
The trained AlphaZero model knows lots of things about Go, in a comparable way to how a dog knows lots of things about running. But the algorithm that gives rise to that model can know arbitrarily few things. (After all, the laws of physics gave rise to us, but they know nothing at all.)
Challenge: know everything that the best go bot knows about go

Suppose you have a computer program that gets two neural networks, simulates a game of go between them, determines the winner, and uses the outcome to modify the neural networks. It seems to me that this program has a model of the 'go world', i.e. a simulator, and from that model you can fairly easily extract the rules and winning condition. Do you think that this is a model but not a mental model, or that it's too exact to count as a model, or something else?

2Richard Ngo1y
I'd say that this is too simple and programmatic to be usefully described as a mental model. The amount of structure encoded in the computer program you describe is very small, compared with the amount of structure encoded in the neural networks themselves. (I agree that you can have arbitrarily simple models of very simple phenomena, but those aren't the types of models I'm interested in here. I care about models which have some level of flexibility and generality, otherwise you can come up with dumb counterexamples like rocks "knowing" the laws of physics.) As another analogy: would you say that the quicksort algorithm "knows" how to sort lists? I wouldn't, because you can instead just say that the quicksort algorithm sorts lists, which conveys more information (because it avoids anthropomorphic implications). Similarly, the program you describe builds networks that are good at Go, and does so by making use of the rules of Go, but can't do the sort of additional processing with respect to those rules which would make me want to talk about its knowledge of Go.
Challenge: know everything that the best go bot knows about go

I think there's some communication failure where people are very skeptical of this for reasons that they think are obvious given what they're saying, but which are not obvious to me. Can people tell me which subset of the below claims they agree with, if any? Also if you come up with slight variants that you agree with that would be appreciated.

  1. It is approximately impossible to succeed at this challenge.
  2. It is possible to be confident that advanced AGI systems will not pose an existential threat without being able to succeed at this challenge.
  3. It is not
... (read more)
1Adam Shimi1y
My take is: * I think making this post was a good idea. I'm personally interested in deconfusing the topic of universality (which basically should capture what "learning everything the model knows"), and you brought up a good "simple" example to try to build intuition on. * What I would call your mistake is a mostly 8, but a bit of the related ones (so 3 and 4?). Phrasing it as "can we do that" is a mistake in my opinion because the topic is very confused (as shown by the comments). On the other hand, I think asking the question of what it would mean is a very exciting problem. It also gives a more concrete form to the problem of deconfusing universality, which is important AFAIK to Paul's approaches to alignment.
Challenge: know everything that the best go bot knows about go

I'd also be happy with an inexact description of what the bot will do in response to specified strategies that captured all the relevant details.

Challenge: know everything that the best go bot knows about go

I think that it isn't clear what constitutes "fully understanding" an algorithm.

That seems right.

Another obstacle to full understanding is memory. Suppose your go bot has memorized a huge list of "if you are in such and such situation move here" type rules.

I think there's reason to believe that SGD doesn't do exactly this (nets that memorize random data have different learning curves than normal nets iirc?), and better reason to think it's possible to train a top go bot that doesn't do this.

There is not in general a way to compute what an algorith

... (read more)
1DanielFilan1y
I'd also be happy with an inexact description of what the bot will do in response to specified strategies that captured all the relevant details.
Challenge: know everything that the best go bot knows about go

Hmmm. It does seem like I should probably rewrite this post. But to clarify things in the meantime:

  • it's not obvious to me that this is a realistic target, and I'd be surprised if it took fewer than 10 person-years to achieve.
  • I do think the knowledge should 'cover' all the athlete's ingrained instincts in your example, but I think the propositions are allowed to look like "it's a good idea to do x in case y".
2Richard Ngo1y
Perhaps I should instead have said: it'd be good to explain to people why this might be a useful/realistic target. Because if you need propositions that cover all the instincts, then it seems like you're basically asking for people to revive GOFAI. (I'm being unusually critical of your post because it seems that a number of safety research agendas lately have become very reliant on highly optimistic expectations about progress on interpretability, so I want to make sure that people are forced to defend that assumption rather than starting an information cascade.)
Challenge: know everything that the best go bot knows about go

On that definition, how does one train an AlphaZero-like algorithm without knowing the rules of the game and win condition?

0Richard Ngo1y
The human knows the rules and the win condition. The optimisation algorithm doesn't, for the same reason that evolution doesn't "know" what dying is: neither are the types of entities to which you should ascribe knowledge.
Challenge: know everything that the best go bot knows about go

Perhaps the bot knows different things at different times and your job is to figure out (a) what it always knows and (b) a way to quickly find out everything it knows at a certain point in time.

I think at this point you've pushed the word "know" to a point where it's not very well-defined; I'd encourage you to try to restate the original post while tabooing that word.

This seems particularly valuable because there are some versions of "know" for which the goal of knowing everything a complex model knows seems wildly unmanageable (for example, trying to convert a human athlete's ingrained instincts into a set of propositions). So before people start trying to do what you suggested, it'd be good to explain why it's actually a realistic target.

Challenge: know everything that the best go bot knows about go

Also it certainly knows the rules of go and the win condition.

0Richard Ngo1y
As an additional reason for the importance of tabooing "know", note that I disagree with all three of your claims about what the model "knows" in this comment and its parent. (The definition of "know" I'm using is something like "knowing X means possessing a mental model which corresponds fairly well to reality, from which X can be fairly easily extracted".)
Challenge: know everything that the best go bot knows about go

But once you let it do more computation, then it doesn't have to know anything at all, right? Like, maybe the best go bot is, "Train an AlphaZero-like algorithm for a million years, and then use it to play."

I would say that bot knows what the trained AlphaZero-like model knows.

1DanielFilan1y
Also it certainly knows the rules of go and the win condition.
Challenge: know everything that the best go bot knows about go

Maybe it nearly suffices to get a go professional to know everything about go that the bot does? I bet they could.

3Adam Shimi1y
What does that mean though? If you give the go professional a massive transcript of the bot knowledge, it's probably unusable. I think what the go professional gives you is the knowledge of where to look/what to ask for/what to search.
Challenge: know everything that the best go bot knows about go

[D]oes understanding the go bot in your sense imply that you could play an even game against it?

I imagine so. One complication is that it can do more computation than you.

1ESRogs1y
But once you let it do more computation, then it doesn't have to know anything at all, right? Like, maybe the best go bot is, "Train an AlphaZero-like algorithm for a million years, and then use it to play." I know more about go than that bot starts out knowing, but less than it will know after it does computation. I wonder if, when you use the word "know", you mean some kind of distilled, compressed, easily explained knowledge?
Challenge: know everything that the best go bot knows about go

You could plausibly play an even game against a go bot without knowing everything it knows.

2weathersystems1y
Sure. But the question is can you know everything it knows and not be as good as it? That is, does understanding the go bot in your sense imply that you could play an even game against it?
Mundane solutions to exotic problems

FYI: I would find it useful if you said somewhere what 'epistemic competitiveness' means and linked to it when using the term.

3Adam Shimi1y
I assume the right pointer is ascription universality [https://ai-alignment.com/towards-formalizing-universality-409ab893a456].
Load More