# All of TurnTrout's Comments + Replies

Actual answer: Because the entire field of experimental psychology that's why.

This excerpt isn’t specific so it’s hard to respond, but I do think there’s a lot of garbage in experimental psychology (like every other field), and more specifically I believe that Eliezer has cited some papers in his old blog posts that are bad papers. (Also, even when experimental results are trustworthy, their interpretation can be wrong.) I have some general thoughts on the field of evolutionary psychology in Section 1 here.

Eliezer's reasoning is surprisingly weak here. It ...

That is to say, prior to "simulators" and "shard theory", a lot of focus was on utility-maximizers--agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.

FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to "ar...

Lol, cool. I tried the "4 minute" challenge (without having read EY's answer, but having read yours).

Hill-climbing search requires selecting on existing genetic variance on alleles already in the gene pool. If there isn’t a local mutation which changes the eventual fitness of the properties which that genotype unfolds into, then you won’t have selection pressure in that direction. On the other hand, gradient descent is updating live on a bunch of data in fast iterations which allow running modifications over the parameters themselves. It’s like being

...

Yeah, this read really bizarrely to me. This is a good way of making sense of that section, maybe. But then I'm still confused why Scott concluded "oh I was just confused in this way" and then EY said "yup that's why you were confused", and I'm still like "nope Scott's question seems correctly placed; evolutionary history is indeed screened off by the runtime hyperparameterization and dataset."

Thanks for leaving this comment, I somehow only just now saw it.

Given your pseudocode it seems like the only point of planModificationSample is to produce plan modifications that lead to high outputs of self.diamondShard(self.WM.getConseq(plan)). So why is that not "optimizing the outputs of the grader as its main terminal motivation"?

I want to make a use/mention distinction. Consider an analogous argument:

"Given gradient descent's pseudocode it seems like the only point of backward is to produce parameter modifications that lead to low outputs of loss_fn....

2Rohin Shah4d
Overall disagreement: Yeah, I think I have at least some sense of how this works in the kinds of examples you usually discuss (though my sense is that it's well captured by the "grader is complicit" point in my previous comment, which you presumably disagree with). But I don't see how to extend the extensional definition far enough to get to the conclusion that IDA, debate, RRM etc aren't going to work.  Okay, that makes sense. So then the implementations of the shards would look like: def diamondShard(conseq): return conseq.query("Number of diamonds") def diamondShardShard(conseq): return conseq.query("Output of diamondGrader") But ultimately the conseq queries just report the results of whatever cognition the world model does, so these implementations are equivalent to: def diamondShard(plan): return WM.predict(plan, 'diamonds') def diamondShardShard(plan): return WM.predict(plan, 'diamondGrader') Two notes on this: 1. Either way you are choosing plans on the basis of the output of some predictive / evaluative model. In the first case the predictive / evaluative model is the world model itself, in the second case it is the composition of the diamond grader and the world model. 2. It's not obvious to me that diamondShardShard is terrible for getting diamonds -- it depends on what diamondGrader does! If diamondGrader also gets to reflectively consider the plan, and produces low outputs for plans that (it can tell) would lead to it being tricked / replaced in the future, then it seems like it could work out fine. In either case I think you're depending on something (either the world model, or the diamond grader) to notice when a plan is going to trick / deceive an evaluator. In both cases all aspects of the planner remain the same (as you suggested you only need to change diamondShard to diamondShardShard rather than changing anything in the planner), so it's not the adversarial optimization from the planner is dif

Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization  -- i.e .effectively planning over a world model

the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values.

Most human v

...

Strong upvoted. I appreciate the strong concreteness & focus on internal mechanisms of cognition.

Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) -> action mappings. EG a shard agent might have:

1. An "it's good to give your friends chocolate" subshard
2. A "give dogs treats" subshard
3. -> An impulse to give dogs chocolate, even though the shard agent knows what the result would be

But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)

In this way, changing a small set of decision-relevant features (e.g. "Br...

I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters."

11 fewer words, but I don't think it communicates the intended concept!

If you have to say "I don't mean one obvious reading of the title" as the first sentence, it's probably not a good title. This isn't a dig -- titling posts is hard, and I think it's fair to not be satisfied with the one I gave. I asked ChatGPT to generate several new titles; lightly edited:

1. "Unc
...
1Charlie Steiner10d
Since it was evidently A Thing, I have caved to peer pressure :P Yeah, this is a good point. I do indeed think that just plowing ahead wouldn't work as a matter of fact, even if shard theory alignment is easy-in-the-way-I-think-is-plausible, and I was vague about this. This is because the way in which I think it's plausible for it to be easy is some case (3) that's even more restricted than (1) or (2).  Like 3: If we could read the textbook from the future and use its ontology, maybe it would be easy / robust to build an RL agent that's aligned because of the shard theory alignment story. To back up: in nontrivial cases, robustness doesn't exist in a vacuum - you have to be robust to some distribution of perturbations. For shard theory alignment to be easy, it hast to be robust to the choices we have to make about building AI, and specifically to the space of different ways we might make those choices. This space of different ways we could make choices depends on the ontology we're using to think about the problem - a good ontology / way of thinking about the problem makes the right degrees of freedom "obvious," and makes it hard to do things totally wrong. I think in real life, if we think "maybe this doesn't need more work and just we don't know it yet," what's actually going to happen is that for some of the degrees of freedom we need to set, we're going to be using an ontology that allows for perturbations where the thing's not robust, depressing the chances of success exponentially.

The policy of truth is a blog post about why policy gradient/REINFORCE suck. I'm leaving a shortform comment because it seems like a classic example of wrong RL theory and philosophy, since reward is not the optimization target. Quotes:

Our goal remains to find a policy that maximizes the total reward after  time steps.

And hence the following is a general purpose algorithm for maximizing rewards with respect to parametric distributions:

If you start with a reward function whose values are in  and you subtract one million

...

"Magic," of course, in the technical sense of stuff we need to remind ourselves we don't know how to do. I don't mean this pejoratively, locating magic is an important step in trying to demystify it.

I think this title suggests a motte/bailey, and also seems clickbait-y. I think most people scanning the title will conclude you mean it in a perjorative sense, such that shard theory requires impossibilities or unphysical miracles to actually work. I think this is clearly wrong (and I imagine you to agree). As such, I've downvoted for the moment.

AFAICT y...

6Charlie Steiner15d
I'll have to eat the downvote for now - I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters." But how do we learn that fact? If extremely-confident-you says "the diamond-alignment post would literally work" and I say "what about these magical steps where you make choices without knowing how to build confidence in them beforehand" and extremely-confident-you says "don't worry, most choices work fine because value formation is robust," how did they learn that value formation is robust in that sense? I think it is unlikely but plausible that shard theory alignment could turn out to be easy, if only we had the textbook from the future. But I don't think it's plausible that getting that textbook is easy. Yes, we have arguments about human values that are suggestive, but I don't see a way to go from "suggestive" to "I am actually confident" that doesn't involve de-mystifying the magic.

As a datapoint, I remember briefly talking with Eliezer in July 2021, where I said "If only we could make it really cringe to do capabilities/gain-of-function work..." (I don't remember which one I said). To which, I think he replied "That's not how human psychology works."

I now disagree with this response. I think it's less "human psychology" and more "our current sociocultural environment around these specific areas of research." EG genetically engineering humans seems like a thing which, in some alternate branches, is considered "cool" and "exciting", while being cringe in our branch. It doesn't seem like a predestined fact of human psychology that that field had to end up being considered cringe.

The existence of the human genome yields at least two classes of evidence which I'm strongly interested in.

1. Humans provide many highly correlated datapoints on general intelligence (human minds), as developed within one kind of learning process (best guess: massively parallel circuitry, locally randomly initialized, self-supervised learning + RL).
1. We thereby gain valuable information about the dynamics of that learning process. For example, people care about lots of things (cars, flowers, animals, friends), and don't just have a single unitary mesa-obj
...

Still on the topic of deception, there are arguments suggesting that something like GPT will always be "deceptive" for Goodhart's Law and Siren World reasons. We can only reward an AI system for producing answers that look good to us, but this incentivizes the system to produce answers that look increasingly good to us, rather than answers that are actually correct. "Looking good" and "being correct" correlate with each other to some extent, but will eventually be pushed apart once there's enough optimization pressure on the "looking good" part.

As such, th

...

Thanks for registering a guess! I would put it as: a grader optimizer is something which is trying to optimize the outputs of a grader as its terminal end (either de facto, via argmax, or intent-alignment, as in "I wanna search for plans which make this function output a high number"). Like, the point of the optimization is to make the number come out high.

(To help you checksum: It feels important to me that "is good at achieving its goals" is not tightly coupled to "approximating argmax", as I'm talking about those terms. I wish I had fast ways of communicating my intuitions here, but I'm not thinking of something more helpful to say right now; I figured I'd at least comment what I've already written.)

On testing, however, the retrained MB* does not show any visible inclination like this. In retrospect, that made sense - it relied on the assumption that the internal representation of the objective is bidirectional, that the parameter-reward mapping is linear. A high-level update signal in one direction doesn’t necessitate that the inverted signal results in the inverted direction. This direction was a bust, but it was useful for me to make incorrect implicit assumptions like this more explicit.

I think it's improbable that agents internalize a single obje...

1Arun Jose23d
Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.

Yeah, IMO "RL at scale trains search-based mesa optimizers" hypothesis predicts "solving randomly generated mazes via a roughly unitary mesa objective and heuristic search" with reasonable probability, and that seems like a toy domain to me.

TurnTrout and Garrett have a post about how we shouldn't make agents that choose plans by maximizing against some "grader" (i.e. utility function), because it will adversarially exploit weaknesses in the grader.

To clarify: A "grader" is not just "anything with a utility function", or anything which makes subroutine calls to some evaluative function (e.g. "how fun does this plan seem?"). A grader-optimizer is not an optimizer which has a grader. It is an optimizer which primarily wants to maximize the evaluations of a grader. Compare

1. "I plan and
...
3Jeremy Gillen1mo

I'm going to just reply with my gut responses here, hoping this clarifies how I'm considering the issues. Not meaning to imply we agree or disagree.

which will include agents that maximize the reward in most situations way more often than if you select a random agent.

Probably, yeah. Consider a network which received lots of policy gradients from the cognitive-update-intensity-signals ("rewards"[1]) generated by the "go to coin?" subroutine. I agree that this network will tend to, in the deployment distribution, tend to take actions which average higher sum-...

Strong-upvote, strong disagreevote, thanks so much for writing this out :) I'm not going to reply more until I've cleared up this thread with you (seems important, if you think the pseudocode was a grader-optimizer, well... I think that isn't obviously doomed).

5Rohin Shah1mo

Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where "what if we had sampled during training?" is well-defined and fine. I was wondering if you viewed this as a general question we could ask.

I also agree that Ajeya's post addresses this "ambiguity" question, which is nice!

Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.

I don't understand why you think this explains away the evidential impact, and I guess I put way less weight on selection reasoning than you do. My reasoning here goes:

1. Lots of animals do reinforcement learning.
2. In particular, humans prominently do reinforcement learning.
3. Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals.
4. "RL -> high chance of caring about reality" predicts this observation m
...
2Paul Christiano1mo

Yes, model-based approaches, model-free approaches (with or without critic), AIXI— all of these should be analyzed on their mechanistic details.

Thanks for this comment! I think it makes some sense (but would have been easier to read given meaningful variable names).

Bob's alignment strategy is that he wants X = X1 = Y = Y1 = Z = Z1. Also he wants the end result to be an agent whose good behaviours (Z) are in fact maximising a utility function at all (in this case, Z1).

I either don't understand the semantics of "=" here, or I disagree. Bob's strategy doesn't make sense because X and Z have type behavior, X1 and Z1 have type utility function, Y is some abstract reward function over some mathematical ...

Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there's an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigation and recurrent memory, there may be an order in which enduring spatial awareness emerges (i.e. "object permanence").

In A shot at the diamond-alignment problem, I wrote:

...

Quick summary of a major takeaway from Reward is not the optimization target

Stop thinking about whether the reward is "representing what we want", or focusing overmuch on whether agents will "optimize the reward function." Instead, just consider how the reward and loss signals affect the AI via the gradient updates. How do the updates affect the AI's internal computations and decision-making?

1Gunnar Zarncke1mo
Are there different classes of learning systems that optimize for the reward in different ways?

Thus, the heuristics generator can only begin as a generator of heuristics that serve . (Even if it wouldn't start out perfectly pointed at .)

We're apparently anchoring our expectations on "pointed at R", and then apparently allowing some "deviation." The anchoring seems inappropriate to me.

The network can learn to make decisions via a "IF circle-detector fires, THEN upweight logits on move-right" subshard. The network can then come to make decisions on the basis of round things, in a way which accords with the policy gradients generated ...

I bid for us to discuss a concrete example. Can you posit a training environment which matches what you're thinking about, relative to a given network architecture [e.g. LSTM]?

And that generator would need to be such that the heuristics it generates are always optimized for achieving , instead of pointing in some arbitrary direction — or, at least, that's how the greedy optimization process would attempt to build it

What is "achieving R" buying us? The agent internally represents a reward function, and then consults what the reward is in this scenario...

2Thane Ruthenis1mo
Sure, gimme a bit. What mechanism does this contextual generation? How does this mechanism behave in off-distribution environments; what goals does it generate in them? ... Yes, absolutely. I wonder if we've somehow still been talking past each other to an extreme degree? E. g., I don't think I'm arguing for a "reward-optimizer" the way you seem to think of them — I don't think we'd get a wirehead, an agent that optimizes for getting reinforcement events. Okay, a sketch at a concrete example: the cheese-finding agent from the Goal Misgeneralization paper [https://arxiv.org/pdf/2105.14111.pdf]. I'm not arguing that in the limit of an ideal training process, it'd converge towards wireheading. I'm arguing that it'd converge towards cheese-finding instead of upstream correlates of cheese-finding (as it actually does in the paper). And if the training environment is diverse/complex enough (too complex for the agent's memory to contain all the heuristics it may need), but the reinforcement schedule is still "shaped around" some natural goal (like cheese-finding), the agent would develop a heuristics generator that would generate heuristics robustly pointed at that natural goal. (So, e. g., even if it were placed in some non-Euclidean labyrinth containing alien cheese, it'd still figure out what "cheese" is and start optimizing to get to it.)

That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.

It seems to me like you're assuming a fix-point on updating. Something like "The network eventually will be invariant under reward-updates under all/the vast majority of training-sampled scenarios, and for a wide enough distribution on scenarios, this means optimizing reward directly."

This seems fine to me, under the given assumptions on SGD/evolution. Like, yes, there may exist certain populations of genetically-specified wrapper...

1Thane Ruthenis1mo
How so? This seems like the core disagreement. Above, I think you're agreeing that under a wide enough distribution on scenarios, the only zero-gradient agent-designs are those that optimize for R directly. Yet that somehow doesn't imply that training an agent in a sufficiently diverse environment would shape it into an R-optimizer? Are you just saying that there aren't any gradients from initialization to an R-optimizer? That is, in any sufficiently diverse environment, the SGD just never converges to zero loss? Okay, sure. Let's suppose that we have a shard economy that uniquely identifies R and always points itself in R's direction. Would it not essentially act as an R-optimizing wrapper-mind? Because if not, it sounds like it'd underperform compared to an R-optimizer. And if so, if there exists a series of incremental updates that moves this shard economy towards an R-optimizing wrapper-mind, the SGD would make that series of updates. Do you disagree that (1) it'd be behaviorally indistinguishable from a wrapper-mind, or that (2) it'd underperform on R compared to an R-optimizer, or that (3) there is such a series of incremental updates? Edit: Also, see here [https://www.lesswrong.com/posts/wdC8fH8kHffYn3kNa/in-defense-of-wrapper-minds?commentId=x5BPCWk4AyTHqQDTv] on what I mean by a "wide enough distribution on scenarios".

Steve and I talked more, and I think the perceived disagreement stemmed from unclear writing on my part. I recently updated Don't design agents which exploit adversarial inputs to clarify:

• ETA 12/26/22: When I write "grader optimization", I don't mean "optimization that includes a grader", I mean "the grader's output is the main/only quantity being optimized by the actor."
• Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I'm not a grader-optimizer relative my internal plan-is-fun? gr
...

Updated with an important terminological clarification:

• ETA 12/26/22: When I write "grader optimization", I don't mean "optimization that includes a grader", I mean "the grader's output is the main/only quantity being optimized by the actor."
• Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I'm not a grader-optimizer relative my internal plan-is-fun? grader.
• However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.

I don't understand the analogy with humans. It sounds like you are saying "an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward" be analogous to "humans care about the details of their reward circuitry." But:

• I don't think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.

I don't think this engages with the substance of the analogy to humans. I don't think a...

2Paul Christiano1mo
This is incredibly weak evidence. * Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals. * Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains. Both of those observations have high probability, so they aren't significant Bayesian evidence for "RL tends to produce external goals by default." In particular, for this to be evidence for Richard's claim, you need to say: "If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition." There's some update there but it's just not big. It's easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward.  My view is probably the other way---humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).

If people train AI systems on random samples of deployment, then "reward" does make sense---it's just what would happen if you sampled this episode to train on.

I don't know what this means. Suppose we have an AI which "cares about reward" (as you think of it in this situation). The "episode" consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg.

What is the "reward" for this situation? What would have happened if we "sampled" this episode during training?

4Paul Christiano1mo
I agree there are all kinds of situations where the generalization of "reward" is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data. It's possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to. As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.

It is therefore very plausible that RL systems would in fact continue to maximize the reward after training, even if what they're ultimately maximizing is just something highly correlated with it.

What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?

Can you give an example of such a motivational structure, so I know we're considering the same thing?

ML systems in general seem to be able to generalize to human-labeled categories in situations that aren't in the training da

...
No, I mean that they'll maximize a reward function that is ≈equal to the reward function on the training data (thus, highly correlated), and a plausible extrapolation of it outside of the training data. Take the coinrun example, the actual reward is "go to the coin", and in the training data this coincides with "go to the right". In test data from a similar distribution this coincides too. Of course, this correlation breaks when the agent optimizes hard enough. But the point is that the agents you get are only those that optimize a plausible extrapolation of the reward signal in training, which will include agents that maximize the reward in most situations way more often than if you select a random agent. Is your point in: That you think agents won't be maximizing reward at all? I would think that even if they don't ultimately maximize reward in all situations, the situations encountered in test will be similar enough to training that agents will still kind of maximize reward there. (And agents definitely behave as reward maximizers in the specific seen training points, because that's what SGD is selecting) I'm not sure I understand what we disagree on at the moment.

Thanks for the story! I may comment more on it later.

You can also get around the "IDK the mechanism" issue if you observe variance over the relevant trait in the population you're selecting over. Like, if we/SGD could veridically select over "propensity to optimize approval OOD", then you don't have to know the mechanism. The variance shows you there are some mechanisms, such that variation in their settings leads to variation in the trait (e.g. approval OOD).

But the designers can't tell that. Can SGD tell that? (This question feels a bit confused, so please extra scan it before answering it)

From their...

No, SGD can't tell the degree to which some agent generalizes a trait outside the training distribution. But empirically, it seems that RL agents reinforced to maximize some reward function (e.g. the Atari game score) on data points; do fairly well at maximizing that reward function OOD (such as when playing the game again from a different starting state). ML systems in general seem to be able to generalize to human-labeled categories in situations that aren't in the training data (e.g. image classifiers working, LMs able to do poetry). It is therefore very plausible that RL systems would in fact continue to maximize the reward after training, even if what they're ultimately maximizing is just something highly correlated with it.

I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where human approval diverges from what the humans "would want if fully informed".

I actually want to controversy that. I'm now going to write quickly about selection arguments in alignment more generally (thi...

Strongly agree with this in particular: (emphasis mine). I think it's an application of the no free lunch razor [https://www.alignmentforum.org/posts/sdrCBWpqyvNBJSZH5/the-no-free-lunch-theorems-and-their-razor] It is clear that selecting for X selects for agents which historically did X in the course of the selection. But how this generalizes outside of the selecting strongly depends on the selection process and architecture. It could be a capabilities generalization, reward generalization for the written-down reward, generalization for some other reward function, or something else entirely. We cannot predict how the agent will generalize without considering the details of its construction.
3Alex Turner1mo
You can also get around the "IDK the mechanism" issue if you observe variance over the relevant trait in the population you're selecting over. Like, if we/SGD could veridically select over "propensity to optimize approval OOD", then you don't have to know the mechanism. The variance shows you there are some mechanisms, such that variation in their settings leads to variation in the trait (e.g. approval OOD).  But the designers can't tell that. Can SGD tell that? (This question feels a bit confused, so please extra scan it before answering it) From their perspective, they cannot select on approval OOD, except insofar as selecting for approval on-training rules out some settings which don't pursue approval OOD. (EG If I want someone to watch my dog, i can't scan the dogsitter and divine what they will do alone in my house. But if the dogsitter steals from me in front of my face during the interview, I can select against that. Combined with "people who steal from you while you watch, will also steal when you don't watch", I can get a tiny bit of selection against thieving dogsitters, even if I can't observe them once I've left.)

If the specific claim was closer to "yes, RL algorithms if ran until convergence have a good chance of 'caring' about reward, but in practice we'll never run RL algorithms that way", then I think this would be a much stronger objection.

This is part of the reasoning (and I endorse Steve's sibling comment, while disagreeing with his original one). I guess I was baking "convergence doesn't happen in practice" into my reasoning, that there is no force which compels agents to keep accepting policy gradients from the same policy-gradient-intensity-producing func...

This post posits that the WM will have a "similar format" throughout, but that heuristics/shards may not. For example, you point out that the WM has to be able to arbitrarily interlace and do subroutine calls (e.g. "will there be a dog near me soon" circuit presumably hooks into "object permanence: track spatial surroundings").

(I confess that I don't quite know what it would mean for shards to have "different encodings" from each other; would they just not have ways to do API calls on each other? Would they be written under eg different "internal programmi...

I think this post is very thoughtful, with admirable attempts at formalization and several interesting insights sprinkled throughout. I think you are addressing real questions, including:

1. why do people wonder why they 'really' did something?
2. How and when do shards generalize beyond contextual reflex behaviors into goals?
3. To what extent will heuristics/shards be legible / written in "similar formats"?

That said, I think some of your answers and conclusions are off/wrong:

1. You rely a lot on selection-level reasoning in a way which feels sketchy.
2. I
...
1Thane Ruthenis1mo
Thanks for extensive commentary! Here's an... unreasonably extensive response. ON PROCEDURAL KNOWLEDGE 1) Suppose that you have a shard that looks for a set of conditions like "it's night AND I'm resting in an unfamiliar location in a forest AND there was a series of crunching sounds nearby". If they're satisfied, it raises an alarm, and forms and bids for plans to look in the direction of the noises and get ready for a fight. That's procedural knowledge: none of that is happening at the level of conscious understanding, you're just suddenly alarmed and urged to be on guard, without necessarily understanding why. Most of the computations are internal to the shard, understood by no other part of the agent. You can "excavate" this knowledge by reflecting on what happened: that you heard these noises in these circumstances, and some process in you responded. Then you can look at what happened afterward (e. g., you were attacked by an animal), and realize that this process helped you. This would allow you to explicate the procedural knowledge into a conscious heuristic ("beware of sound-patterns like this at night, get ready if you hear them"), which you put in the world-model and can then consciously access. That "conscious access" would allow you to employ the knowledge much more fluidly, such as by: * Incorporating it in plans in advance. (You can know to ensure there's no sources of natural noise around your camp, like waterfalls, because you'd know that being able to hear your surroundings is important.) * Transferring it to others. (Telling this heuristic to your child, who didn't yet learn the procedural-knowledge shard itself.) * Generalizing from it. (Translate it by analogy to an alien environment where you have to "listen" to magnetic fields instead. Or to even more abstract environments, like bureaucratic conflicts, where there's something "like" being in a forest at night (situation-of-uncertain-safety) and "like" hearing crun

What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us.

Updated mentions of "cognitive groove" to "circuit", since some readers found the former vague and unhelpful.

Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories.  can detect this by simply detecting bad stuff.

I think it extremely probable that there exist policies which exploit adversarial inputs to J such that they can do bad stuff while getting J to say "all's fine."

For most Js I agree, but the existence of any adversarial examples for J would be an outer alignment problem (you get what you measure). (For outer alignment, it seems necessary that there exist—and that humans discover—natural abstractions relative to formal world models that robustly pick out at least the worst stuff.)

As an implication of this, I could imagine that in most real-world settings "don't kill humans" would act as you describe, but in environments where it's very easy to accidentally kill humans, such that states where you don't kill humans are actually very rare, then the "don't kill humans" shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?

I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenari...

3Robert Kirk1mo
I'm trying to understand why the juice shard has this propety. Which of these (if any) are the the explanation for this: * Bigger juice shards will bid on actions which will lead to juice multiple times over time, as it pushes the agent towards juice from quite far away (both temporally and spatially), and hence will be strongly reinforcement when the reward comes, even though it's only a single reinforcement event (actually getting the juice). * Juice will be acquired more with stronger juice shards, leading to a kind of virtuous cycle, assuming that getting juice is always positive reward (or positive advantage/reinforcement, to avoid zero-point issues) The first seems at least plausibly to also to apply to "avoid moldy food", if it requires multiple steps of planning to avoid moldy food (throwing out moldy food, buying fresh ingredients and then cooking them, etc.) The second does seem to be more specific to juice than mold, but it seems to me that's because getting juice is rare, and is something we can better and better at, whereas avoiding moldy food is something that's fairly easy to learn, and past that there's not much reinforcement to happen. If that's the case, then I kind of see that as being covered by the rare-states explanation in my previous comment, or maybe an extension of that to "rare states and skills in which improvement leads to more reward". Having just read tailcalled comment, I think that is in some sense another of phasing what I was trying to say, where rare (but not too rare) states are likely to mean that policy-caused variance is high on those decisions. Probably policy-caused variance is more fundamental/closer as an explanation to what's actually happening in the learning process, but maybe states of certain rarity which are high-reward/reinforcement is one possibly environmental feature that produces policy-caused variance.

On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.

This is only true for optimal policies, no? For learned policies, positive reward will upweight and generalize certain circuits (like "approach juice"), while negative reward will downweight and generally-discourage those same circuits. This can then lead to path-dependent differences in generalization (e.g....

1Chris_Leong2mo
Good point. (That said, it seems like to useful check to see what the optimal policy will do. And if someone believes it won't achieve the optimal policy, it seems useful to try to understand the barrier that stops that. I don't feel quite clear on this yet).

Additionally, "loss function" is often used to also refer to the supervised labels in a dataset. EG I don't imagine proponents of "find an aligned loss function" to be imagining moving away from  loss and towards KL. They're thinking about dataset labels  for each point , and then given a prediction  and a loss function , we can provide a loss signal which maps datapoints and predictions to loss:

Nice update!

On the flip side, I expect we’ll see more discussion about which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them.

While I don't think of these as alignment targets per se (as I understand the term to be used), I strongly support discussing the internal language of the neural net and moving away from convoluted inner/outer schemes.

Similarly, if you solve abstraction, you solve interpretability, shard theory, value alignment, corrigibility, etc.

In what way do you think solving abstraction would solve shard theory?