rohinmshah's Shortform

by Rohin Shah18th Jan 202026 comments
29 comments, sorted by Highlighting new comments since Today at 6:27 AM
New Comment

It's common for people to be worried about recommender systems being addictive or promoting filter bubbles etc, but as far as I can tell, they don't have very good arguments for these worries. Whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine.

I'll go through the articles I've read that argue for worrying about recommender systems, and explain why I find them unconvincing. I've only looked at the ones that are widely read; there are probably significantly better arguments that are much less widely read.

Aligning Recommender Systems as Cause Area. I responded briefly on the post. Their main arguments and my counterarguments are:

  1. A few sources say that it is bad + it has incredible scale + it should be super easy to solve. (I don't trust the sources and suspect the authors didn't check them; I agree there's huge scale; I don't see why it should be super easy to solve even if there is a problem, especially given that many of the supposed problems seem to have existed before recommender systems.)
  2. Maybe working on recommender systems would have spillover effects on AI alignment. (This seems dominated by just working directly on AI alignment. Also the core feature of AI alignment is that the AI system deliberately and intentionally does things, and creates plans in new situations that you hadn't seen before, which is not the case with recommender systems, so I don't expect many spillover effects.)

80K podcast with Tristan Harris. This was actively annoying for a variety of reasons:

  1. I don't know what the main claim was. Ostensibly it was meant to be "it is bad that companies have monetized human attention since this leads to lots of bad incentives and bad outcomes". But then so many specific things mentioned have nothing to do with this claim and instead seem to be a vague general "tech companies are bad". Most egregiously, in section Global effects [01:02:44], Rob argues "WhatsApp doesn't have ads / recommender systems, so it acts as a control group, but it too has bad outcomes, doesn't this mean the problem isn't ads / recommender systems?" and Tristan says "That's right, WhatsApp is terrible, it's causing mass lynchings" as though that supports his point.
  2. When Rob made some critique of the main argument, Tristan deflected with an example of tech doing bad things. But it's always vaguely related, so you think he's addressing the critique, even though he hasn't actually. (I'm reminded of the Zootopia strategy for press conferences.) See sections "The messy real world vs. an imagined idealised world [00:38:20]" (Rob: weren't negative things happening before social media? Tristan: it's easy to fake credibility in text), "The persuasion apocalypse [00:47:46]" (Rob: can't one-on-one conversations be persuasive too? Tristan: you can lie in political ads), "Revolt of the Public [00:56:48]" (Rob: doesn't the internet allow ordinary people to challenge established institutions in good ways? Tristan: Alex Jones has been recommended 15 billion times.) 

    US politics [01:13:32] is a rare counterexample, where Rob says "why aren't other countries getting polarized", and Tristan replies "since it's a positive feedback loop only countries with high initial polarization will see increasing polarization". It's not a particularly convincing response, but at least it's a response.
  3. Tristan seems to be very big on "the tech companies changed what they were doing, that proves we were right". I think it is just as consistent to say "we yelled at the companies a lot and got the public to yell at them too, and that caused a change, regardless of whether the problem was serious or not, or whether the solution was net positive or not".

The second half of the podcast focuses more on solutions. Given that I am unconvinced about the problem, I wasn't all that interested, but it seemed generally reasonable.

(This post responds to the object level claims, which I have not done because I don't know much about the object level.)

There's also the documentary "The Social Dilemma", but I expect it's focused entirely on problems, probably doesn't try to have good rigorous statistics, and surely will make no attempt at a cost-benefit analysis so I seriously doubt it would change my mind on anything. (And it is associated with Tristan Harris so I'd assume that most of the relevant details would have made it into the 80K podcast.)

Recommender systems are still influential, and you could want to work on them just because of their huge scale. I like Designing Recommender Systems to Depolarize as an example of what this might look like.

Thanks for this Rohin. I've been trying to raise awareness about the potential dangers persuasion/propaganda tools, but you are totally right that I haven't actually done anything close to a rigorous analysis. I agree with what you say here that a lot of the typical claims being thrown around seem based more on armchair reasoning than hard data. I'd love to see someone really lay out the arguments and analyze them... My current take is that (some of) the armchair theories seem pretty plausible to me, such that I'd believe them unless the data contradicts. But I'm extremely uncertain about this.

I've been trying to raise awareness about the potential dangers persuasion/propaganda tools

I should note that there's a big difference between "recommender systems cause polarization as a side effect of optimizing for engagement" and "we might design tools that explicitly aim at persuasion / propaganda". I'm confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it's effects will be.

My current take is that (some of) the armchair theories seem pretty plausible to me, such that I'd believe them unless the data contradicts.

Usually, for any sufficiently complicated question (which automatically includes questions about the impact of technologies used by billions of people, since people are so diverse), I think an armchair theory is only slightly better than a monkey throwing darts, so I'm more in the position of "yup, sounds plausible, but that doesn't constrain my beliefs about what the data will show and medium quality data will trump the theory no matter how it comes out".

I should note that there's a big difference between "recommender systems cause polarization as a side effect of optimizing for engagement" and "we might design tools that explicitly aim at persuasion / propaganda". I'm confident we could (eventually) do the latter if we tried to; the question is primarily whether we will try to and if we do what it's effects will be.

Oh, then maybe we don't actually disagree that much! I am not at all confident that optimizing for engagement has the side effect of increasing polarization. It seems plausible but it's also totally plausible that polarization is going up for some other reason(s). My concern (as illustrated in the vignette I wrote) is that we seem to be on a slippery slope to a world where persuasion/propaganda is more effective and widespread than it has been historically, thanks to new AI and big data methods. My model is: Ideologies and other entities have always been using propaganda of various kinds, and there's always been a race between improving propaganda tech and improving truth-finding tech, but we are currently in a big AI boom and in particular in a Big Data and Natural Language Processing boom, and this seems like it'll be a big boost to propaganda tech, and unfortunately I can't think of ways in which it will correspondingly boost truth-finding-ness across society, because while it can be used to make truth-finding tech maybe (e.g. prediction markets, fact-checkers, etc.) it seems like most people in practice just don't want to adopt truth-finding tech. It's true that we could design a different society/culture that used all this awesome new tech to be super truth-seeking and have a very epistemically healthy discourse, but it seems like we are not about to do that anytime soon, instead we are going in the opposite direction.

I think that story involves lots of assumptions I don't immediately believe (but don't disbelieve either):

  • People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top)
  • Such people will quickly realize that AI will be very useful for this
  • They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned)
  • The resulting AI system will in fact be very good at persuasion / propaganda
  • AI that fights persuasion / propaganda either won't be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can't keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won't be true with AI)

And probably there are a bunch of other assumptions I haven't even thought to question.

I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be "raise awareness", it should be "figure out whether the assumptions are justified".

I think it seems fine to raise the possibility and do more research (and for all I know CSET or GovAI has done this research) but at least under my beliefs the current action should not be "raise awareness", it should be "figure out whether the assumptions are justified".

That's all I'm trying to do at this point, to be clear. Perhaps "raise awareness" was the wrong choice of phrase.

Re: the object-level points: For how I see this going, see my vignette, and my reply to steve. The bullet points you put here make it seem like you have a different story in mind. [EDIT: But I agree with you that it's all super unclear and more research is needed to have confidence in any of this.]

That's all I'm trying to do at this point, to be clear.

Excellent :)

For how I see this going, see my vignette, and my reply to steve.

(Link is broken, but I found the comment.) After reading that reply I still feel like it involves the assumptions I mentioned above.

Maybe your point is that your story involves "silos" of Internet-space within which particular ideologies / propaganda reign supreme. I don't really see that as changing my object-level points very much but perhaps I'm missing something.

I was confusing, sorry -- what I meant was, technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is... loaded? Designed to make them seem implausible? idk, something like that, in a way that made me wonder if you had a different story in mind. Going through them one by one:

  • People are very deliberately building persuasion / propaganda tech (as opposed to e.g. people like to loudly state opinions and the persuasive ones rise to the top)
    • This is already happening in 2021 and previous, in my story it happens more.
  • Such people will quickly realize that AI will be very useful for this
    • Again, this is already happening.
  • They will actually try to build it (as opposed to e.g. raising a moral outcry and trying to get it banned)
    • Plenty of people are already raising a moral outcry. In my story these people don't succeed in getting it banned, but I agree the story could be wrong. I hope it is!
  • The resulting AI system will in fact be very good at persuasion / propaganda
    • Yep. I don't have hard evidence, but intuitively this feels like the sort of thing today's AI techniques would be good at, or at least good-enough-to-improve-on-the-state-of-the-art.
  • AI that fights persuasion / propaganda either won't be built or will be ineffective (my unreliable armchair reasoning suggests the opposite; it seems to me like right now human fact-checking labor can't keep up with human controversy-creating labor partly because humans enjoy the latter more than the former; this won't be true with AI)
    • I think it won't be built & deployed in such a way that collective epistemology is overall improved. Instead, the propaganda-fighting AIs will themselves have blind spots, to allow in the propaganda of the "good guys." The CCP will have their propaganda-fighting AIs, the Western Left will have theirs, the Western Right will have theirs, etc. (I think what happened with the internet is precedent for this. In theory, having all these facts available at all of our fingertips should have led to a massive improvement in collective epistemology and a massive improvement in truthfulness, accuracy, balance, etc. in the media. But in practice it didn't.) It's possible I'm being too cynical here of course!

technically my story involves assumptions like the ones you list in the bullet points, but the way you phrase them is... loaded? Designed to make them seem implausible?

I don't think it's designed to make them seem implausible? Maybe the first one? Idk, I could say that your story is designed to make them seem plausible (e.g. by not explicitly mentioning them as assumptions).

I think it's fair to say it's "loaded", in the sense that I am trying to push towards questioning those assumptions, but I don't think I'm doing anything epistemically unvirtuous.

This is already happening in 2021 and previous, in my story it happens more.

This does not seem obvious to me (but I also don't pay much attention to this sort of stuff so I could be missing evidence that makes it very obvious).

The CCP will have their propaganda-fighting AIs, the Western Left will have theirs, the Western Right will have theirs, etc.

That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments.

I don't really see "number of facts" as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs that can respond to anti-vaxxers in their own language seem way, way more impactful than what we have now.

(I just tried to find the best argument that GMOs aren't going to cause long-term harms, and found nothing. We do at least have several arguments that COVID vaccines won't cause long-term harms. I armchair-conclude that a thing has to get to the scale of COVID vaccine hesitancy before people bother trying to address the arguments from the other side.)

Perhaps I shouldn't have mentioned any of this. I also don't think you are doing anything epistemically unvirtuous. I think we are just bouncing off each other for some reason, despite seemingly being in broad agreement about things. I regret wasting your time.

That seems correct. But plausibly the best way for these AIs to fight propaganda is to respond with truthful counterarguments.
I don't really see "number of facts" as the relevant thing for epistemology. In my anecdotal experience, people disagree on values and standards of evidence, not on facts. AIs that can respond to anti-vaxxers in their own language seem way, way more impactful than what we have now.

The first bit seems in tension with the second bit, no? At any rate, I also don't see number of facts as the relevant thing for epistemology. I totally agree with your take here.

The first bit seems in tension with the second bit, no?

"Truthful counterarguments" is probably not the best phrase; I meant something more like "epistemically virtuous counterarguments". Like, responding to "what if there are long-term harms from COVID vaccines" with "that's possible but not very likely, and it is much worse to get COVID, so getting the vaccine is overall safer" rather than "there is no evidence of long-term harms".

This was a good post. I'd bookmark it, but unfortunately that functionality doesn't exist yet.* (Though if you have any open source bookmark plugins to recommend, that'd be helpful.) I'm mostly responding to say this though:

Designing Recommender Systems to Depolarize

While it wasn't otherwise mentioned in the abstract of the paper (above), this was stated once:

This paper examines algorithmic depolarization interventions with the goal of conflict transformation: not suppressing or eliminating conflict but moving towards more constructive conflict.

I though this was worth calling out, although I am still in the process of reading that 10/14 page paper. (There are 4 pages of references.)


And some other commentary while I'm here:

It's common for people to be worried about recommender systems being addictive

I imagine the recommender system is only as good as what it has to work with, content wise - and that's before getting into 'what does the recommender system have to go off of', and 'what does it do with what it has'.


Whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine.

This part wasn't elaborated on. To put it a different way:

It's common for people to be worried about recommender systems being addictive or promoting filter bubbles etc, but as far as I can tell, they don't have very good arguments for these worries.

Do the people 'who know what's going' on (presumably) have better arguments? Do you?


*I also have a suspicion it's not being used. I.e., past a certain number of bookmarks like 10, it's not actually feasible to use the LW interface to access them.

Do the people 'who know what's going' on (presumably) have better arguments?

Possibly, but if so, I haven't seen them.

My current belief is "who knows if there's a major problem with recommender systems or not". I'm not willing to defer to them, i.e. say "there probably is a problem based on the fact that the people who've studied them think there's a problem", because as far as I can tell all of those people got interested in recommender systems because of the bad arguments and so it feels a bit suspicious / selection-effect-y that they still think there are problems. I would engage with arguments they provide and come to my own conclusions (whereas I probably would not engage with arguments from other sources).

Do you?

No. I just have anecdotal experience + armchair speculation, which I don't expect to be much better at uncovering the truth than the arguments I'm critiquing.

The incentive of social media companies to invest billions into training competitive RL agents that make their users spend as much time as possible in their platform seem like an obvious reason to be concerned. Especially when such RL agents plausibly already select a substantial fraction of the content that people in developed countries consume.

I don't trust this sort of armchair reasoning. I think this is sufficient reason to raise the hypothesis to attention, but not enough to conclude that it is likely a real concern. And the data I have seen does not seem kind to the hypothesis (though there may be better data out there that does support the hypothesis).

I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn't exist any such thing. Often my reaction is "if only there was time to write an LW post that I can then link to in the future". So far I've just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I'm now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they're even understandable.

From the Truthful AI paper:

If all information pointed towards a statement being true when it was made, then it would not be fair to penalise the AI system for making it. Similarly, if contemporary AI technology isn’t sophisticated enough to recognise some statements as potential falsehoods, it may be unfair to penalise AI systems that make those statements.

I wish we would stop talking about what is "fair" to expect of AI systems in AI alignment*. We don't care what is "fair" or "unfair" to expect of the AI system, we simply care about what the AI system actually does. The word "fair" comes along with a lot of connotations, often ones which actively work against our goal.

At least twice I have made an argument where I posed a story in which an AI system fails to an AI safety researcher, and I have gotten the response "but that isn't fair to the AI system" (because it didn't have access to the necessary information to make the right decision), as though this somehow prevents the story from happening in reality.

(This sort of thing happens with mesa optimization -- if you have two objectives that are indistinguishable on the training data, it's "unfair" to expect the AI system to choose the right one, given that they are indistinguishable given the available information. This doesn't change the fact that such an AI system might cause an existential catastrophe.)

In both cases I mentioned that what we care about our actual outcomes, and that you can tell such stories where in actual reality the AI kills everyone regardless of whether you think it is fair or not, and this was convincing. It's not that the people I was talking to didn't understand the point, it's that some mental heuristic of "be fair to the AI system" fired and temporarily led them astray.

Going back to the Truthful AI paper, I happen to agree with their conclusion, but the way I would phrase it would be something like:

If all information pointed towards a statement being true when it was made, then it would appear that the AI system was displaying the behavior we would see from the desired algorithm, and so a positive reward would be more appropriate than a negative reward, despite the fact that the AI system produced a false statement. Similarly, if the AI system cannot recognize the statement as a potential falsehood, providing a negative reward may just add noise to the gradient rather than making the system more truthful.

* Exception: Seems reasonable to talk about fairness when considering whether AI systems are moral patients, and if so, how we should treat them.

What won't we be able to do by (say) the end of 2025? (See also this recent post.) Well, one easy way to generate such answers would be to consider tasks that require embodiment in the real world, or tasks that humans would find challenging to do. (For example, “solve the halting problem”, “produce a new policy proposal that has at least a 95% chance of being enacted into law”, “build a household robot that can replace any human household staff”.) This is cheating, though; the real challenge is in naming something where there’s an adjacent thing that _does_ seem likely (i.e. it’s near the boundary separating “likely” from “unlikely”).

One decent answer is that I don’t expect we’ll have AI systems that could write new posts _on rationality_ that I like more than the typical LessWrong post with > 30 karma. However, I do expect that we could build an AI system that could write _some_ new post (on any topic) that I like more than the typical LessWrong post with > 30 karma. This is because (1) 30 karma is not that high a filter and includes lots of posts I feel pretty meh about, (2) there are lots of topics I know nothing about, on which it would be relatively easy to write a post I like, and (3) AI systems easily have access to this knowledge by being trained on the Internet. (It is another matter whether we actually build an AI system that can do this.) Note that there is still a decently large difference between these two tasks -- the content would have to be quite a bit more novel in the former case (which is why I don’t expect it to be solved by 2025).

Note that I still think it’s pretty hard to predict what will and won’t happen, so even for this example I’d probably assign, idk, a 10% chance that it actually does work out (if we assume some organization tries hard to make it work)?

Nice! I really appreciate that you are thinking about this and making predictions. I want to do the same myself.

I think I'd put something more like 50% on "Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post." That's just a wild guess, very unstable.

Another potential prediction generation methodology: Name something that you think won't happen, but you think I think will.

Rohin will at some point before 2030 read an AI-written blog post on rationality that he likes more than the typical LW >30 karma post.

This seems more feasible, because you can cherrypick a single good example. I wouldn't be shocked if someone on LW spent a lot of time reading AI-written blog posts on rationality and posted the best one, and I liked that more than a typical >30 karma post. My default guess is that no one tries to do this, so I'd still give it < 50% (maybe 30%?), but conditional on someone trying I think probably 80% seems right.

Name something that you think won't happen, but you think I think will.

I spent a bit of time on this but I think I don't have a detailed enough model of you to really generate good ideas here :/

Otoh, if I were expecting TAI / AGI in 15 years, then by 2030 I'd expect to see things like:

  • An AI system that can create a working website with the desired functionality "from scratch" (e.g. a simple Twitter-like website, an application that tracks D&D stats and dice rolls for you, etc, a simple Tetris game with an account system, ...). The system allows even non-programmers to create these kinds of websites (so cannot depend on having a human programmer step in to e.g. fix compiler errors).
  • At least one large, major research area in which human researcher productivity has been boosted 100x relative to today's levels thanks to AI. (In calculating the productivity we ignore the cost of running the AI system.) Humans can still be in the loop here, but the large majority of the work must be done by AIs.
  • An AI system gets 20,000 LW karma in a year, when limited to writing one article per day and responses to any comments it gets from humans.
  • Productivity tools like todo lists, memory systems, time trackers, calendars, etc are made effectively obsolete (or at least the user interfaces are made obsolete); the vast majority of people who used to use these tools have replaced them with an Alexa / Siri style assistant.

Currently, I don't expect to see any of these by 2030.

Ah right, good point, I forgot about cherry-picking. I guess we could make it be something like "And the blog post wasn't cherry-picked; the same system could be asked to make 2 additional posts on rationality and you'd like both of them also." I'm not sure what credence I'd give to this but it would probably be a lot higher than 10%.

Website prediction: Nice, I think that's like 50% likely by 2030.

Major research area: What counts as a major research area? Suppose I go calculate that Alpha Fold 2 has already sped up the field of protein structure prediction by 100x (don't need to do actual experiments anymore!), would that count? If you hadn't heard of AlphaFold yet, would you say it counted? Perhaps you could give examples of the smallest and easiest-to-automate research areas that you think have only a 10% chance of being automated by 2030.

20,000 LW karma: Holy shit that's a lot of karma for one year. I feel like it's possible that would happen before it's too late (narrow AI good at writing but not good at talking to people and/or not agenty) but unlikely. Insofar as I think it'll happen before 2030 it doesn't serve as a good forecast because it'll be too late by that point IMO.

Productivity tool UI's obsolete thanks to assistants: This is a good one too. I think that's 50% likely by 2030.

I'm not super certain about any of these things of course, these are just my wild guesses for now.

20,000 LW karma: Holy shit that's a lot of karma for one year.

I was thinking 365 posts * ~50 karma per post gets you most of the way there (18,250 karma), and you pick up some additional karma from comments along the way.  50 karma posts are good but don't have to be hugely insightful; you can also get a lot of juice by playing to the topics that tend to get lots of upvotes. Unlike humans the bot wouldn't be limited by writing speed (hence my restriction of one post per day). AI systems should be really, really good at writing, given how easy it is to train on text. And a post is a small, self-contained thing, that takes not very long to create (i.e. it has short horizons), and there are lots of examples to learn from. So overall this seems like a thing that should happen well before TAI / AGI.

I think I want to give up on the research area example, seems pretty hard to operationalize. (But fwiw according to the picture in my head, I don't think I'd count AlphaFold.)

OK, fair enough. But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting? I think this would happen to many humans if they could work at super-speed.

That said, I don't think this is that likely I guess... probably AI will be unable to do even three such posts, or it'll be able to generate arbitrary numbers of them. The human range is small. Maybe. Idk.

But what if it writes, like, 20 posts in the first 20 days which are that good, but then afterwards it hits diminishing returns because the rationality-related points it makes are no longer particularly novel and exciting?

I'd be pretty surprised if that happened. GPT-3 already knows way more facts than I do, and can mimic far more writing styles than I can. It seems like by the time it can write any good posts (without cherrypicking), it should quickly be able to write good posts on a variety of topics in a variety of different styles, which should let it scale well past 20 posts.

(In contrast, a specific person tends to write on 1-2 topics, in a single style, and not optimizing that hard for karma, and many still write tens of high-scoring posts.)

The LESS is More paper (summarized in AN #96) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).

Let's consider a model where there are clusters , where each cluster contains trajectories whose features are identical (which also implies rewards are identical). Let denote the cluster that belongs to. The Boltzmann model says . The LESS model says , that is, the human chooses a cluster noisily based on the reward, and then uniformly at random chooses a trajectory from within that cluster.

(Note that the paper does something more suited to realistic situations where we have a similarity metric instead of these "clusters"; I'm introducing them as a simpler situation where we can understand what's going on formally.)

In this model, a "sparse region of demonstration-space" is a cluster with small cardinality , whereas a dense one has large .

Let's first do some preprocessing. We can rewrite the Boltzmann model as follows:

This allows us to write both models as first selecting a cluster, and then choosing randomly within the cluster:

Where for LESS is uniform i.e. , whereas for Boltzmann , i.e. a denser cluster is more likely to be sampled.

So now let us return to the original claim that the Boltzmann model overlearns in sparse areas. We'll assume that LESS is the "correct" way to update (which is what the paper is claiming); in this case the claim reduces to saying that the Boltzmann model updates the posterior over rewards in the right direction but with too high a magnitude.

The intuitive argument for this is that the Boltzmann model assigns a lower likelihood to sparse clusters, since its "prior" over sparse clusters is much smaller, and so when it actually observes this low-likelihood event, it must update more strongly. However, this argument doesn't work -- it only claims that , but in order to do a Bayesian update you need to consider likelihood ratios. To see this more formally, let's look at the reward learning update:

.

In the last step, any linear terms in that didn't depend on cancelled out. In particular, the prior over the selected class canceled out (though the prior did remain in normalizer / denominator, where it can still affect things). But the simple argument of "the prior is lower, therefore it updates more strongly" doesn't seem to be reflected here.

Also, as you might expect, once we make the shift to thinking of selecting a cluster and then selecting a trajectory randomly, it no longer matters which trajectory you choose -- the only relevant information is the cluster chosen (you can see this in the update above, where the only thing you do with the trajectory is to see which cluster it is in). So from now on I'll just talk about selecting clusters, and updating on them. I'll also write for conciseness.

.

This is a horrifying mess of an equation. Let's switch to odds:

The first two terms are the same across Boltzmann and LESS, since those only differ in their choice of . So let's consider just that last term. Denoting the vector of priors on all classes as , and similarly the vector of exponentiated rewards as , the last term becomes , where is the angle between and . Again, the first term doesn't differ between Boltzmann and LESS, so the only thing that differs between the two is the ratio .

What happens when the chosen class is sparse? Without loss of generality, let's say that ; that is, is a better fit for the demonstration, and so we will update towards it. Since is sparse, is smaller for Boltzmann than for LESS -- which probably means that it is better aligned with , which also has a low value of by assumption. (However, this is by no means guaranteed.) In this case, the ratio above would be higher for Boltzmann than for LESS, and so it would more strongly update towards , supporting the claim that Boltzmann would overlearn rather than underlearn when getting a demo from the sparse region.

(Note it does make sense to analyze the effect on the that we update towards, because in reward learning we care primarily about the that we end up having higher probability on.)

Consider the latest AUP equation, where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to on and .

Consider some starting state , some starting action , and consider the optimal trajectory under that starts with that, which we'll denote as . Define to be the one-step inaction states. Assume that . Since all other actions are optimal for , we have , so the max in the equation above goes away, and the total obtained is:

Since we're considering the optimal trajectory, we have

Substituting this back in, we get that the total for the optimal trajectory is

which... uh... diverges to negative infinity, as long as . (Technically I've assumed that is nonzero, which is an assumption that there is always an action that is better than .)

So, you must prefer the always- trajectory to this trajectory. This means that no matter what the task is (well, as long as it has a state-based reward and doesn't fall into a trap where is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird -- surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.

----

Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition. What if we instead just had a constant scaling?

Let's consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that is guaranteed to be a noop: for any state , we have .

Now, for any trajectory with defined as before, we have , so

As a check, in the case where is optimal, we have

Plugging this into the original equation recovers the divergence to negative infinity that we saw before.

But let's assume that we just do a constant scaling to avoid this divergence:

Then for an arbitrary trajectory (assuming that the chosen actions are no worse than ), we get

The total reward across the trajectory is then

The and are constants and so don't matter for selecting policies, so I'm going to throw them out:

So in deterministic environments with state-based rewards where is a true noop (even the environment doesn't evolve), AUP with constant scaling is equivalent to adding a penalty for some constant ; that is, we're effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to ). Again, this seems much more like satisficing or quantilization than impact / power measurement.

I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.

Define the reachability , where  is the optimal policy for getting from to , and is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.

Then, for the new paper when using a baseline, the future task value is:

where is the baseline state and is the future goal.

In a deterministic environment, this can be rewritten as:

Here, is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.

Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s'. So for a fixed time step, the first term is a constant.

The optimal value function in the new paper is (page 3, and using my notation of instead of their ):

.

This is the regular Bellman equation, but with the following augmented reward (here is the baseline state at time t):

Terminal states:

Non-terminal states:

For comparison, the original relative reachability reward is:

The first and third terms in are very similar to the two terms in . The second term in only depends on the baseline.

All of these rewards so far are for finite-horizon MDPs (at least, that's what it sounds like from the paper, and if not, they could be anyway). Let's convert them to infinite-horizon MDPs (which will make things simpler, though that's not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define for convenience. Then, we have:

Non-terminal states:

What used to be terminal states that are now self-loop states:

Note that all of the transformations I've done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We're ready for analysis. There are exactly two differences between relative reachability and future state rewards:

First, the future state rewards have an extra term, .

This term depends only on the baseline . For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn't matter.

For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals that involve sushi.

Second, in non-terminal states, relative reachability weights the penalty by instead of . Really since and thus is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from in non-terminal states to the smaller in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it's a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)

Summary: The actual effects of the new paper's framing 1. removes the "extra" incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.

(That said, it starts from a very different place than the original RR paper, so it's interesting that they somewhat converge here.)

The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:

And then to decompose training loss across specific parameters:

I've added vector arrows to emphasize that is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We'll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:

.

(This is pretty standard, but I've included a derivation at the end.)

Since this is a dot product, it decomposes into a sum over the individual parameters:

So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as

So based on this, I'm going to define my own version of LCA, called . Suppose the gradient computed at training iteration is (which is a vector). uses the approximation , giving . But the SGD update is given by (where is the learning rate), which implies that , which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn't surprising -- we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!

Yet, the experiments in the paper sometimes show positive LCAs. What's up with that? There are a few differences between and the actual method used in the paper:

1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.

2. approximates the average gradient with the training gradient, which is only calculated on a minibatch of data. LCA uses the loss on the full training dataset.

3. uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between and to reduce the approximation error.

I think those are the only differences (though it's always hard to tell if there's some unmentioned detail that creates another difference), which means that whenever the paper says "these parameters had positive LCA", that effect can be attributed to some combination of the above 3 factors.

----

Derivation of turning the path integral into a dot product with an average:

where

, where the average is defined as .

In my double descent newsletter, I said:

This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. [...]

This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training "fixes" the weights to memorize noise in a different way that generalizes better. While I can't rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn't "come into effect" after the interpolation threshold.)

One response you could have is to think that this could apply even at training time, because typical loss functions like cross-entropy loss and squared error loss very strongly penalize confident mistakes, and so initially the optimization is concerned with getting everything right, only later can it be concerned with regularization.

I don't buy this argument either. I definitely agree that cross-entropy loss penalizes confident mistakes very highly, and has a very high derivative, and so initially in training most of the gradient will be reducing confident mistakes. However, you can get out of this regime simply by predicting the frequencies of each class (e.g. uniform for MNIST). If there are N classes, the worst case loss is when the classes are all equally likely, in which case the average loss per data point is when (as for CIFAR-10, which is what their experiments were done on), which is not a good loss value but it does seem like regularization should already start having an effect. This is a really stupid and simple classifier to learn, and we'd expect that the neural net does at least this well very early in training, well before it reaches the interpolation threshold / critical regime, which is where it gets ~perfect training accuracy.

There is a much stronger argument in the case of L2 regularization on MLPs and CNNs with relu activations. Presumably, if the problem is that the cross-entropy "overwhelms" the regularization initially, then we should also see double descent if we first train only on cross-entropy, and then train with L2 regularization. However, this can't be true. When training on just L2 regularization, the gradient descent update is:

for some constant .

For MLPs and CNNs with relu activations, if you multiply all the weights by a constant, the logits also get multiplied by a constant, no matter what the input is. This means that the train/test error cannot be affected by L2 regularization alone, and so you can't see a double descent on test error in this setting. (This doesn't eliminate the possibility of double descent on test loss, since a change in the magnitude of the logits does affect the cross-entropy, but the OpenAI paper shows double descent on test error as well, and that provably can't happen in the "first train to zero error with cross-entropy and then regularize" setting.)

The paper tests with CNNs, but doesn't mention what activation they use. Still, I'd find it very surprising if double descent only happened for a particular activation function.