All of DanielFilan's Comments + Replies

Did this ever get written up? I'm still interested in it.

2Abram Demski3h
Ah, not yet, no. 
1Lawrence Chan2d
I don't think they're hiring, but added. 

Any reversible effect might be reversed. The question asks about the final effects of the mind

This talk of "reversible" and "final" effects of a mind strikes me as suspicious: for one, in a block / timeless universe, there's no such thing as "reversible" effects, and for another, in the end, it may wash out in an entropic mess! But it does suggest a rephrasing of "a first-order approximation of the (direction of the) effects, understood both spatially and temporally".

1Tsvi Benson-Tilsen14d
I don't really like the block-universe thing in this context. Here "reversible" refers to a time-course that doesn't particularly have to be physical causality; it's whatever course of sequential determination is relevant. E.g., don't cut yourself off from acausal trades. I think "reversible" definitely needs more explication, but until proven otherwise I think it should be taken on faith that the obvious intuition has something behind it.

Is the idea that the set of "states" is the codomain of gamma?

2Sami Petersen15d
No, the codomain of gamma is the set of (distributions over) consequences. Hammond's notation is inspired by the Savage framework in which states and consequences are distinct. Savage thinks of a consequence as the result of behaviour or action in some state, though this isn't so intuitively applicable in the case of decision trees. I included it for completeness but I don't use the gamma function explicitly anywhere.

 assigns the set of states that remain possible once a node is reached.

What's bold S here?

1Sami Petersen15d
It's the set of elementary states. So S∈P(S) is an event (a subset of elementary states, S⊆S). E.g., we could have S be all the possible worlds; S={S1,...,Sn} be the possible worlds in which featherless bipeds evolved; and S7∈S be our actual world.
2DanielFilan15d
Is the idea that the set of "states" is the codomain of gamma?

We argue that a sustained commitment to transparency (e.g. to auditors) would make the RLHF research environment more robust from a safety standpoint.

Do you think this is more true of RLHF than other safety techniques or frameworks? At first blush, I would have thought "no", and the reasoning you provide in this post doesn't seem to distinguish RLHF from other things.

3Stephen Casper2mo
No, I don't think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.

I think I probably didn't quite word that question right, and that's what's explaining the confusion - I meant something like "Once you've created the AAR, what alignment problems are left to be solved? Please answer in terms of the gap between the AAR and superintelligence."

Note that this paper in section 2.6 claims to demonstrate that some 175 billion parameter RLHF'ed LLM (presumably Claude or Claude 2?) does not use steganography in its CoT. It does this by:

  1. Get a bunch of answers from the model using CoT prompting.
  2. Take prefixes of CoTs, asking another language model to paraphrase them, regenerate the rest of the CoT, and see how accurate the models are now.
  3. Check if the model loses accuracy for paraphrased CoTs.

The basic result is that paraphrasing parts of the CoT doesn't appear to reduce accuracy.

Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.

The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are "internally represented" isn't necessary for those results. You're right that the post uses the phrase "internal represent... (read more)

5Victoria Krakovna3mo
Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like "We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts". 

The spark of strategicness, if such a thing is possible, recruits the surrounding mental elements. Those surrounding mental elements, by hypothesis, make goals achievable. That means the wildfire can recruit these surrounding elements toward the wildfire's ultimate ends... Also by hypothesis, the surrounding mental elements don't themselves push strongly for goals. Seemingly, that implies that they do not resist the wildfire, since resisting would constitute a strong push.

I think this is probably a fallacy of composition (maybe in the reverse direction ... (read more)

3Tsvi Benson-Tilsen3mo
Good point, though I think it's a non-fallacious enthymeme. Like, we're talking about a car that moves around under its own power, but somehow doesn't have parts that receive, store, transform, and release energy and could be removed? Could be. The mind could be an obscure mess where nothing is factored, so that a cancerous newcomer with read-write access can't get any work out of the mind other than through the top-level interface. I think that explicitness (https://www.lesswrong.com/posts/KuKaQEu7JjBNzcoj5/explicitness) is a very strong general tendency (cline) in minds, but if that's not true then my first reason for believing the enthymeme's hidden premise is wrong.

An attempt at rephrasing a shard theory critique of utility function reasoning, while restricting myself to things I basically agree with:

Yes, there are representation theorems that say coherent behaviour is optimizing some utility function. And yes, for the sake of discussion let's say this extends to reward functions in the setting of sequential decision-making (even tho I don't remember seeing a theorem for that). But: just because there's a mapping, doesn't mean that we can pull back a uniform measure on utility/reward functions to get a reasonable mea... (read more)

Not having read other responses, my attempt to answer in my own words: what goes wrong is that there are tons of possible cognitive influences that could be reinforced by rewards for making people smile. E.g. "make things of XYZ type think things are going OK", "try to promote physical configurations like such-and-such", "trying to stimulate the reinforcer I observe in my environment". Most of these decision-influences, when extrapolated to coherent behaviour where those decision-influences drive the course of the behaviour, lead to resource-gathering and ... (read more)

I think you're making a mistake: policies can be reward-optimal even if there's not an obvious box labelled "reward" that they're optimal with respect to the outputs of. Similarly, the formalism of "reward" can be useful even if this box doesn't exist, or even if the policy isn't behaving the way you would expect if you identified that box with the reward function. To be fair, the post sort of makes this mistake by talking about "internal representations", but I think everything goes thru if you strike out that talk.

The main thing I want to talk about

I

... (read more)
2Alex Turner3mo
In addition to my other comment, I'll further quote Behavioural statistics for a maze-solving agent:
3Alex Turner4mo
Can you say more? Maybe give an example of what this looks like in the maze-solving regime? This is a fair question, because I left a lot to the reader. I'll clarify now. I was not claiming that you can't, after the fact, rationalize observed behavior using the extremely flexible reward-maximization framework.  I was responding to the specific claim of assuming internal representation of a 'training-compatible' reward function. In evaluating this claim, we shouldn't just see whether this claim is technically compatible with empirical results, but we should instead reason probabilistically. How strongly does this claim predict observed data, relative to other models of policy formation? In the maze setting, the cheese was always in the top-right 5x5 corner. The reward was sparse and only used to update the network when the mouse hit the cheese. The "training compatible goal set" is unconstrained on the test set. An example element might agree with the training reward on the training distribution, and then outside of the training distribution, assign 1 reward iff the mouse is on the bottom-left square. The vast majority of such unconstrained functions will not involve pursuing cheese reliably across levels, and most of these reward functions will not be optimized by going to the top-right part of the maze. So this "training-compatible" hypothesis barely assigns any probability to the observed generalization of the network.  However, other hypotheses -- like "the policy develops motivations related to obvious correlates of its historical reinforcement signals"[1] -- predict things like "the policy tends to go to the top-right 5x5, and searches for cheese more strongly once there." I registered such a prediction before seeing any of the generalization behavior. This hypothesis assigns high probability to the observed results. So this paper's assumption is simply losing out in a predictive sense, and that's what I was critiquing. One can nearly always rationalize
2Alex Turner4mo
I'm responding to this post, so why should I strike that out?  The post is talking about internal representations.

I guess this doesn't fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn't/can't know the truth, compared to a "strict liability" regime.

I wonder if this use of "fair" is tracking (or attempting to track) something like "this problem only exists in an unrealistically restricted action space for your AI and humans - in worlds where it can ask questions, and we can make reasonable preparation to provide obviously relevant info, this won't be a problem".

Possibly, but in at least one of the two cases I was thinking of when writing this comment (and maybe in both), I made the argument in the parent comment and the person agreed and retracted their point. (I think in both cases I was talking about deceptive alignment via goal misgeneralization.)

I guess this doesn't fit with the use in the Truthful AI paper that you quote. Also in that case I have an objection that only punishing for negligence may incentivize an AI to lie in cases where it knows the truth but thinks the human thinks the AI doesn't/can't know the truth, compared to a "strict liability" regime.

I think you might be interpreting the break after the sentence "Their results are further evidence for feature linearity and internal activation robustness in these models." as the end of the related work section? I'm not sure why that break is there, but the section continues with them citing Mikolov et al (2013), Larsen et al (2015), White (2016), Radford et al (2016), and Upchurch et al (2016) in the main text, as well as a few more papers in footnotes.

Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as https://arxiv.org/abs/2212.04089 is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."

Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, an... (read more)

Alignment is an unusual field because the base of fans and supporters is much larger than the number of researchers

Isn't this entirely usual? Like, I'd assume that there are more readers of popular physics books than working physicists. Similarly for nature documentary viewers vs biologists.

5Lawrence Chan2mo
I think the deciding difference is that the amount of fans and supporters who want to be actively involved and who think the problem is the most important in the world is much larger than the number of researchers; while popular physics book readers and nature documentary viewers are plentiful, I doubt most of them feel a compelling need to become involved!
3Neel Nanda5mo
Maybe in contrast to other fields of ML? (Though that's definitely stopped being true for eg LLMs)

In this same example, these interpretations are then validated by using a different interpretability tool – test set exemplars. This begs the question of why we shouldn’t just use test set exemplars instead.

Doesn't Olah et al (2017) answer this in the "Why visualize by optimization" section, where they show a bunch of cases where neurons fire on similar test set exemplars, but visualization by optimization appears to reveal that the neurons are actually 'looking for' specific aspects of those images?

3Stephen Casper6mo
I buy this value -- FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.

I guess this proves the superiority of the mechanistic interpretability technique "note that it is mechanistically possible for your model to say that things are gorillas" :P

Re: the gorilla example, seems worth noting that the solution that was actually deployed ended up being refusing to classify anything as a gorilla, at least as of 2018 (perhaps things have changed since then).

3DanielFilan6mo
I guess this proves the superiority of the mechanistic interpretability technique "note that it is mechanistically possible for your model to say that things are gorillas" :P

BTW: the way I found that first link was by searching the title on google scholar, finding the paper, and clicking "All 5 versions" below (it's right next to "Cited by 7" and "Related articles"). That brought me to a bunch of versions, one of which was a seemingly-ungated PDF. This will probably frequently work, because AI researchers usually make their papers publicly available (at least in pre-print form).

1DaemonicSigil6mo
Thanks for the link! Looks like they do put optimization effort into choosing the subspace, but it's still interesting that the training process can be factored into 2 pieces like that.
4DanielFilan6mo
BTW: the way I found that first link was by searching the title on google scholar, finding the paper, and clicking "All 5 versions" below (it's right next to "Cited by 7" and "Related articles"). That brought me to a bunch of versions, one of which was a seemingly-ungated PDF. This will probably frequently work, because AI researchers usually make their papers publicly available (at least in pre-print form).

To tie up this thread: I started writing a more substantive response to a section but it took a while and was difficult and I then got invited to dinner, so probably won't get around to actually writing it.

I don't want to get super hung up on this because it's not about anything Yudkowsky has said but:

Consider the whole transformed line of reasoning:

avian flight comes from a lot of factors; you can't just ape one of the factors and expect the rest to follow; to get an entity which flies, that entity must be as close to a bird as birds are to each other.

IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was:

human intelligence/alignment comes from a lot of factors; you can't just ape one of the factors an

... (read more)

This is a valid point, and that's not what I'm critiquing. I'm critiquing how he confidently dismisses ANNs

I guess I read that as talking about the fact that at the time ANNs did not in fact really work. I agree he failed to predict that would change, but that doesn't strike me as a damning prediction.

Matters would be different if he said in the quotes you cite "you only get these human-like properties by very exactly mimicking the human brain", but he doesn't.

Didn't he? He at least confidently rules out a very large class of modern approaches.

Co... (read more)

This comment doesn't really engage much with your post - there's a lot there and I thought I'd pick one point to get a somewhat substantive disagreement. But I ended up finding this question and thought that I should answer it.

2DanielFilan6mo
To tie up this thread: I started writing a more substantive response to a section but it took a while and was difficult and I then got invited to dinner, so probably won't get around to actually writing it.

But have you ever, even once in your life, thought anything remotely like "I really like being able to predict the near-future content of my visual field. I should just sit in a dark room to maximize my visual cortex's predictive accuracy."?

I think I've been in situations where I've been disoriented by a bunch of random stuff happening and wished that less of it was happening so that I could get a better handle on stuff. An example I vividly recall was being in a history class in high school and being very bothered by the large number of conversations happening around me.

3DanielFilan6mo
This comment doesn't really engage much with your post - there's a lot there and I thought I'd pick one point to get a somewhat substantive disagreement. But I ended up finding this question and thought that I should answer it.

I don't really get your comment. Here are some things I don't get:

  • In "Failure By Analogy" and "Surface Analogies and Deep Causes", the point being made is "X is similar in aspects A to thing Y, and X has property P" does not establish "Y has property P". The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.
  • Large ANNs don't appear to me to be intelligent because of their similarity to human brains - they appear to me to be intelligent because they're able to be t
... (read more)

Edited to modify confidences about interpretations of EY's writing / claims.

In "Failure By Analogy" and "Surface Analogies and Deep Causes", the point being made is "X is similar in aspects A to thing Y, and X has property P" does not establish "Y has property P". The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.

This is a valid point, and that's not what I'm critiquing in that portion of the comment. I'm critiquing how -- on my read -- he confidently dismisses ... (read more)

I no longer endorse this claim about what the orthogonality thesis says.

But given that good, automated mechanistic hypothesis generation seems to be the only hope for scalable MI, it may be time for TAISIC to work on this in earnest. Because of this, I would argue that automating the generation of mechanistic hypotheses is the only type of MI work TAISIC should prioritize at this point in time.

"Automating" seems like a slightly too high bar here, given how useful human thoughts are for things. IMO, a better frame is that we have various techniques for combining human labour and algorithmic computation to generate hypothes... (read more)

I also want to add that I really like the use of the prediction UI in this post.

Ah, you make this point here:

However, in practical applications some optimization pressure against the helper will still leak in, for example if the helper is used to decide whether to abort a training run or change its hyperparameters.

Imitation learning methods seem less risky, as the optimization pressure is simply to match the empirical distribution of a demonstration dataset. The closest to “reward hacking” in this setting would be overfitting to the dataset, a relatively benign failure mode. There is still some risk of inner optimization objectives arising, which could then be adversarial to other systems (e.g. attempt to hide themselves from transparency tools), but comparatively speaking this is one of the methods with the lowest risk of adversarial failure. [Bolding by DanielFila

... (read more)
1AdamGleave6mo
Right: if the agent has learned an inner objective of "do things similar to what humans do in the world at the moment I am currently acting", then it'd definitely be incentivized to do that. It's not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn't help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you're worried about inner optimization, but most the other failures described happen even in the outer alignment case. That said, if you had a different imitation learning setup that was something like doing RL on a reward of "do the same thing one of our human labelers chooses given the same state" then the outer objective would reward what the behavior you describe. It'd be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.

Conversely, it is less likely to happen if the optimization process is pursuing a goal that is largely independent from the helper’s output, such as if the helper is an interpretability tool checking for the presence of mesa-optimization.

Maybe I'm missing something obvious here, but for advanced AI systems under monitoring that happen to be misaligned, won't checking for the presence of mesa-optimization be relevant to the main system's goal, in cases where the results of those checks matter for whether to re-train or deploy the AI?

2DanielFilan6mo
Ah, you make this point here:

Note that the model does not have black or white stones as a concept, and instead only thinks of the stones as “own’s stones” and “opponent’s stones”, so we can do this without loss of generality.

 

I'm confused how this can be true - surely the model needs to know which player is black and which player is white to know how to incorporate komi, right?

4polytope7mo
There's (a pair of) binary channels that indicate whether the acting player is receiving komi or paying it. (You can also think of this as a "player is black" versus "player is white" indicator, but interpreting it as komi indicators is equivalent and is the natural way you would extend Leela Zero to operate on different komi without having to make any changes to the architecture or input encoding). In fact, you can set the channels to fractional values strictly between 0 and 1 to see what the model thinks of a board state given reduced komi or no-komi conditions. Leela Zero is not trained on any value other than the 0 or 1 endpoints corresponding to komi +7.5 or komi -7.5 for the acting player, so there is no guarantee that the behavior for fractional values is reasonable, but I recall people found that many of Leela Zero's models do interpolate their output for the fractional values in a not totally unreasonable way!  If I recall right, it tended to be the smaller models that behaved well, whereas some of the later and larger models behaved totally nonsensically for fractional values. If I'm not mistaken about that being the case, then as a total guess perhaps that's something to do with later and larger models having more degrees of freedom with which to fit/overfit arbitrarily to arbitrarily give rise to non-interpolating behavior in between, and/or having more extreme differences in activations at the end points that constrain the middle less and give it more room to wiggle and do bad non-monotone things.

You can now watch a short video of an excerpt from this episode (an axrpt?)!

Am I right that this algorithm is going to visit each "important" node in once per path from to the output? If so, that could be pretty slow given a densely-connected interpretation, right?

3Lawrence Chan8mo
Yep, this is correct - in the worse case, you could have performance that is exponential in the size of the interpretation.  (Redwood is fully aware of this problem and there have been several efforts to fix it.) 

Empirically, human toddlers are able to recognize apples by sight after seeing maybe one to three examples. (Source: people with kids.)

Wait but they see a ton of images that they aren't told contain apples, right? Surely that should count. (Probably not 2^big_number bits tho)

3johnswentworth8mo
Yes! There's two ways that can be relevant. First, a ton of bits presumably come from unsupervised learning of the general structure of the world. That part also carries over to natural abstractions/minimal latents: the big pile of random variables from which we're extracting a minimal latent is meant to represent things like all those images the toddler sees over the course of their early life. Second, sparsity: most of the images/subimages which hit my eyes do not contain apples. Indeed, most images/subimages which hit my eyes do not contain instances of most abstract object types. That fact could either be hard-coded in the toddler's prior, or learned insofar as it's already learning all these natural latents in an unsupervised way and can notice the sparsity. So, when a parent says "apple" while there's an apple in front of the toddler, sparsity dramatically narrows down the space of things they might be referring to.

As I understand it, the EA forum sometimes idiosyncratically calls this philosophy [rule consequentialism] "integrity for consequentialists", though I prefer the more standard term.

AFAICT in the canonical post on this topic, the author does not mean "pick rules that have good consequences when I follow them" or "pick rules that have good consequences when everyone follows them", but rather "pick actions such that if people knew I was going to pick those actions, that would have good consequences" (with some unspecified tweaks to cover places where that ... (read more)

1Andrew Critch9mo
Ah, thanks for the correction!  I've removed that statement about "integrity for consequentialists" now.

In reality though, I think people often just believe stuff because people nearby them believe that stuff

IMO, a bigger factor is probably people thinking about topics that people nearby them think about, and having the primary factors that influence their thoughts be the ones people nearby focus on.

2Andrew Critch9mo
I agree this is a big factor, and might be the main pathway through which people end up believing what people believe the believe.  If I had to guess, I'd guess you're right. E.g., if there's a evidence E in favor of H and evidence E' against H, if the group is really into thinking about and talking about E as a topic, then the group will probably end up believing H too much. I think it would be great if you or someone wrote a post about this (or whatever you meant by your comment) and pointed to some examples.  I think the LessWrong community is somewhat plagued by attentional bias leading to collective epistemic blind spots.  (Not necessarily more than other communities; just different blind spots.)

One reason that I doubt this story is that "try new things in case they're good" is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you'll probably be a bit scammy even tho you like people and don't want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.

3Alex Turner1y
IDK if this is causally true or just evidentially true. I also further don't know why it would be mechanistically relevant to the heuristic you posit.  Rather, I think that agents might end up with this heuristic at first, but over time it would get refined into "try new things which [among other criteria] aren't obviously going to cause bad value drift away from current values." One reason I expect the refinement in humans is that noticing your values drifted in a bad way is probably a negative reinforcement event, and so enough exploration-caused negative events might cause credit assignment to refine the heuristic into the shape I listed. This would convergently influence agents to not be reward-optimal, even on known-reachable-states. (I'm not super confident in this particular story porting over to AI, but think it's a plausible outcome.) If that's kind of heuristic is a major underpinning of what we call "curiosity" in humans, then that would explain why I am, in general, not curious about exploring a life of crime, but am curious about math and art and other activities which won't cause bad value drift away from my current values.

Actually I'm being silly, you don't need ring signatures, just signatures that are associated with identities and also used for financial transfers.

Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.

Here's one way you can do it: Suppose we're doing public key cryptography, and every person is associated with one public key. Then when you write things online you could use a linkable ring signature. That means that you prove that you're using a private key that corresponds to one of the known public keys, and you also produce a hash of your keypair, such that (a) the world can tell you're one of the known public keys but not which public key you are, and (b) the world can tell that the key hash you used corresponds to the public key you 'committed' to when writing the proof.

Actually I'm being silly, you don't need ring signatures, just signatures that are associated with identities and also used for financial transfers.

Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.

Relevant quote I just found in the paper "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents":

The primary measure of an agent’s performance is the score achieved during an episode, namely the undiscounted sum of rewards for that episode. While this performance measure is quite natural, it is important to realize that score, in and of itself, is not necessarily an indicator of AI progress. In some games, agents can maximize their score by “getting stuck” in a loop of “small” rewards, ignoring what human p

... (read more)

Here's a project idea that I wish someone would pick up (written as a shortform rather than as a post because that's much easier for me):

  • It would be nice to study competent misgeneralization empirically, to give examples and maybe help us develop theory around it.
  • Problem: how do you measure 'competence' without reference to a goal??
  • Prior work has used the 'agents vs devices' framework, where you have a distribution over all reward functions, some likelihood distribution over what 'real agents' would do given a certain reward function, and do Bayesian i
... (read more)

Toryn Q. Klassen, Parand Alizadeh Alamdari, and Sheila A. McIlraith wrote a paper on the multi-agent AUP thing, framing it as a study of epistemic side effects.

Here is an example story I wrote (that has been minorly edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won't end up wanting to game human approval:

  • Agent gets trained on a reward
... (read more)
3DanielFilan1y
One reason that I doubt this story is that "try new things in case they're good" is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you'll probably be a bit scammy even tho you like people and don't want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.
Load More