# All of TurnTrout's Comments + Replies

In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I'd feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors.

In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that's the part which is a pointer problem.

Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?

the pointer problem part is roughly "specify what I mean well enough that I could use the specification to get an AI to do what I mean", assuming problems like "get AI to follow specification" can be solved.

2johnswentworth2d
In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer's objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we're aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about. No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven't solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don't even understand the type signature of "wanting things".

As an example, here are three possible reactions to a no-ghost update:

Suppose that many (EDIT: a few) of your value shards take as input the ghost latent variable in your world model. You learn ghosts aren't real. Let's say this basically sets the ghost-related latent variable value to false in all shard-relevant contexts. Then it seems perfectly fine that most of my shards keep on bidding away and determining my actions (e.g. protect my family), since most of my value shards are not in fact functions of the ghost latent variable. While it's indeed possibl... (read more)

2Abram Demski1d
With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here [https://www.lesswrong.com/posts/yLTpo828duFQqPJfy/builder-breaker-for-deconfusion#Example__Wireheading] , wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach. Output-based evaluation (including supervised learning, and the most popular forms of unsupervised learning, and a lot of other stuff which treats models as black boxes implementing some input/output behavior or probability distribution or similar) can't distinguish between a model which is internalizing the desired concepts, vs a model which is instead modeling the actual feedback process instead. These two do different things, but not in a way that the feedback system can differentiate. In terms of shard theory, as I understand it, the point is that (absent arguments to the contrary, which is what we want to be able to construct), shards that implement feedback-modeling like this cannot be disincentivized by the feedback process, since they perform very well in those terms. Shards which do other things may or may not be disincentivized, but the feedback-modeling shards (if any are formed at any point) definitely won't, unless of course they're just not very good at their jobs. So the problem, then, is: how do we arrange training such that those shards have very little influence, in the end? How do we disincentivize that kind of reasoning at all? Plausibly, this should only be tackled as a knock-on effect of the real problem, actually giving good feedback which points in the right direction; however, it remains a powerful counterexample class which challenges many many proposals. (And therefore, trying to generate the analogue of the wireheading problem for a given proposal seems like a good sanity check.)

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not ... (read more)

3Abram Demski1d
I think that it generally seems like a good idea to have solid theories of two different things: 1. What is the thing we are hoping to teach the AI? 2. What is the training story by which we mean to teach it? I read your above paragraph as maligning (1) in favor of (2). In order to reinforce the desired shards, it seems helpful to have some idea of what those look like. For example, if we avoid fancy philosophical frameworks, we might think a good way to avoid wireheading is to introduce negative examples where the AI manipulates circuitry to boost reinforcement signals, and positive examples where the AI doesn't do that when given the opportunity. After doing some philosophy where you try to positively specify what you're trying to train, it's easier to notice that this sort of training still leaves the human-manipulation failure mode open. After doing this kind of philosophy for a while, it's intuitive to form the more general prediction that if you haven't been able to write down a formal model of the kind of thing you're trying to teach, there are probably easy failure modes like this which your training hasn't attempted to rule out at all.
2Abram Demski1d

The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans. In a sense, this means having aligned goals without having the same goals: your goal is to cooperate with "human goals", but you don't yet have a full description of what human goals are. Your value function might be much simpler than the human value function.

In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.

Insofar as I understa... (read more)

2Abram Demski1d
I said: You said: Thinking about this now, I think maybe it's a question of precautions, and what order you want to teach things in. Very similarly to the argument that you might want to make a system corrigible first, before ensuring that it has other good properties -- because if you make a mistake, later, a corrigible system will let you correct the mistake. Similarly, it seems like a sensible early goal could be 'get the system to understand that the sort of thing it is trying to do, in (value) learning, is to pick up human values'. Because once it has understood this point correctly, it is harder for things to go wrong later on, and the system may even be able to do much of the heavy lifting for you. Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn't seem like a very correctible approach. (I think much of this pessimism comes from mentally visualizing humans arguing about what object-level values to try to teach an AI. Even if the humans are able to agree, I do not feel especially optimistic about their choices, even if they're supposedly informed by neuroscience and not just moral philosophers.)
2Abram Demski1d
I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases -- or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent. This is because writing down the desired inductive biases as an explicit prior can help us to understand what's going on better. It's tempting to say that to understand how the brain learns, is to understand how it treats feedback as evidence, and updates on that evidence. Of course, there could certainly be other theoretical frames which are more productive. But at a deep level, if the learning works, the learning works because the feedback is evidence about the thing we want to learn, and the process which updates on that feedback embodies (something like) a good prior telling us how to update on that evidence. And if that framing is wrong somehow, it seems intuitive to me that the problem should be describable within that ontology, like how I think "utility function" is not a very good way to think about values because what is it a function of; we don't have a commitment to a specific low-level description of the universe which is appropriate for the input to a utility function. We can easily move beyond this by considering expected values as the "values/preferences" representation, without worrying about what underlying utility function generates those expected values. (I do not take the above to be a knockdown argument against "committing to the specific division between outer and inner alignment steers you wrong" -- I'm just saying things that seem true to me and plausibly relevant to the debate.)
2Abram Demski1d
I expect you'll say I'm missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is "figuring out the human prior", because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I'm fine with that. I am not dogmatic on orthodox bayesianism [https://www.lesswrong.com/posts/xJyY5QkQvNJpZLJRo/radical-probabilism-1]. I do not even like utility functions [https://www.lesswrong.com/posts/A8iGaZ3uHNNGgJeaD/an-orthodox-case-against-utility-functions] . I am totally fine with saying "inductive biases" instead of "prior"; I think it indeed pins down what I meant in a more accurate way (by virtue of, in itself, being a more vague and imprecise concept than "prior").

I feel confused. I think this comment is overall good (though I don't think I understand a third of it), but doesn't seem to suggest the genome actually solved information inaccessibility in the form of reliably locating learned WM concepts in the human brain?

But, do you really fundamentally care that your kids have genomes?

Seems not relevant? I think we're running into an under-definition of IGF (and the fact that it doesn't actually have a utility function, even over local mutations on a fixed genotype). Does IGF have to involve genomes, or just information patterns as written in nucleotides or in binary? The "outer objective" of IGF suffers a classic identifiability issue common to many "outer objectives", where the ancestral "training signal" history is fully compatible with "IGF just for genomes" and also ... (read more)

I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus...

And insofar as this impression is correct, this is a mistake. There is only one way alignment is.

If inner/outer is altogether a more faithful picture of those dynamics:

• relatively coherent singular mesa-objectives form in agents, albeit not necessarily always search-based
• more fragility of value and difficulty in getting the mesa object

FWIW I think the most important distinction in "alignment" is aligning with somebody's preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.

I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren't trying to get an AI which optimizes what people want. They're getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that's a subtle distinction which can prove quite fatal.

4Alex Flint3d
Right. Many seem to assume that there is a causal relationship good -> human desires -> human evaluations. They are hoping both that if we do well according to human evaluations then we will be satisfying human desires, and that if we satisfy human desires, we will create a good world. I think both of those assumptions are questionable. I like the analogy in which we consider an alternative world where AI researchers assumed, for whatever parochial reason, that it was actually human dreams that should guide AI behavior. In this world, they ask humans to write down their dreams, and try to devise AIs that would make the world like that. There are two assumptions here: (1) that making the world more like human dreams would be good, and (2) that humans can correctly report their dreams. In the case of dreams, both of these assumptions are suspect, right? But what exactly is the difference with human desires? Why do we assume that either they are a guide to what is good or can be reported accurately?

I think that's also a good thing to think about, but most of the meat is in how you actually reason about that and how it leads to superior or at least adequate+complementary predictions about the behavior of ML systems. I think to the extent this perspective is useful for alignment it also ought to be useful for reasoning about the behavior of existing systems like large language models

Sure. To clarify, superior to what? "GPT-3 reliably minimizes prediction error; it is inner-aligned to its training objective"?

4Paul Christiano3d
I'd describe the alternative perspective as: we try to think of GPT-3 as "knowing" some facts and having certain reasoning abilities. Then to predict how it behaves on a new input, we ask what the best next-token prediction is about the training distribution, given that knowledge and reasoning ability. Of course the view isn't "this is always what happens," it's a way of making a best guess. We could clarify how to set the error bars, or how to think more precisely about what "knowledge" and "reasoning abilities" mean. And our predictions depends on our prior over what knowledge and reasoning abilities models will have, which will be informed by a combination of estimates of algorithmic complexity of behaviors and bang-for-your-buck for different kinds of knowledge, but will ultimately depend on a lot of uncertain empirical facts about what kind of thing language models are able to learn. Overall I acknowledge you'd have to say a lot more to make this into something fully precise, and I'd guess the same will be true of a competing perspective. I think this is roughly how many people make predictions about GPT-3, and in my experience it generally works pretty well and many apparent errors can be explained by more careful consideration of the training distribution. If we had a contest where you tried to give people short advice strings to help them predict GPT-3's behavior, I think this kind of description would be an extremely strong entry. This procedure is far from perfect. So you could imagine something else doing a lot better (or providing significant additional value as a complement).

because I don't know what alignment means that I think it's helpful to have some hand-hold terms like "alignment"

Do you mean "outer/inner alignment"?

Supposing you mean that—I agree that it's good to say "and I'm confused about this part of the problem", while also perhaps saying "assuming I've formulated the problem correctly at all" and "as I understand it."

I don't really disagree with anything you've written, but, in general, I think we should allow some of our words to refer to "big confusing problems" that we don't yet know how to clarify, because we s

2Alex Flint3d
Wonderful! I don't have any complaints per se about outer/inner alignment, but I use it relatively rarely in my own thinking, and it has resolved relatively few of my confusions about alignment. FWIW I think the most important distinction in "alignment" is aligning with somebody's preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.

I plan to mentor several people to work on shard theory and agent foundations this winter through SERI MATS. Apply here if you're interested in working with me and Quintin.

I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)

One of my (TurnTrout's) reasons for alignment optimism is that I think:

• We can examine early-training cognition and behavior to some extent, since the system is presumably not yet superintelligent and planning against us,
• (Although this amount of information depends on how much interpretability and agent-internals theory we do now)
• All else equal, early-training values (decision-influences) are the most important to influence, since they steer
2johnswentworth1d
One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents? [https://www.lesswrong.com/posts/3xF66BNSC5caZuKyC/why-subagents]), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out. Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards which each imperfectly capture different aspects of human values, with different imperfections, then there's possibly changes which Goodhart all of the shards simultaneously. Indeed, I'd expect that to be a pretty strong default outcome.

I think I have two main complaints still, on a skim.

First, I think the following is wrong:

These problems seem pretty hard, as evidenced by the above gaps in the example shard theory plan, and it's not clear that they're easier than inner and outer alignment.

I think outer and inner alignment both go against known/suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more nat... (read more)

I think these are great counterpoints. Thanks for making them.

I still buy "the helicopter parent 'outer alignment' training regime is unwise for 'aligning' kids" and that deliberate parenting is better than chance. But possibly/probably not the primary factor. I haven't yet read much data here so my views feel relatively unconstrained, beyond my "common sense."

I think there's an additional consideration with AI, though: We control the reward circuitry. If lots of variance in kid-alignment is due to genetic variation in reward circuitry or learning hyperparameters or whatever, then we also control that with AI, that is also part of understanding AI inductive biases.

I don't know.

Speculatively, jealousy responses/worries could be downstream of imitation/culture (which "raises the hypothesis"/has self-supervised learning ingrain the completion, such that now the cached completion is a consequence which can be easily hit by credit assignment / upweighted into a real shard). Another source would be negative reward events on outcomes where you end up alone / their attentions stray. Which, itself, isn't from simple reward circuitry, but a generalization of other learned reward events which I expect are themselves downstream of simple reward circuitry. (Not that that reduces my confusion much)

Argument that you can't use a boundedly intelligent ELK solution to search over plans to find one which keeps the diamond in the vault. That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.

Consider any situation where it's hard to keep the diamond in the vault. Then any successful plan will have relatively few degrees of freedom. Like, a bunch of really smart thieves will execute a cunning plot to extract the diamond. You can't just sit by or deploy some simple traps in this situation.

4Rohin Shah11d
The main hope is to have the ELK solution be at least as smart as the plan-generator. See mundane solutions to exotic problems [https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7]:

Before reading, here are my reactions to the main claims:

• Not only does RL not, by default, produce policies which have reward maximization as their behavioral objective, but in fact I argue that it is not possible for RL policies to care about “reward” in an embedded setting. (agreed under certain definitions of "reward")
• I argue that this does not imply that wireheading in RL agents is impossible, because wireheading does not mean “the policy has reward as its objective”.It is still possible for an RL agent to wirehead, and in fact, is a hig

I'm often asked, Why "shard theory"? I suggested this name to Quintin when realizing that human values have the type signature of contextually activated decision-making influences. The obvious choice, then, was to call these things "shards of value"—drawing inspiration from Eliezer's Thou art godshatter, where he originally wrote "shard of desire."

(Contrary to several jokes, the choice was not just "because 'shard theory' sounds sick.")

This name has several advantages. Value-shards can have many subshards/facets which vary contextually (a real crysta... (read more)

I think that shards will cast contextual shadows into the factors of a person’s equilibrated utility function, because I think the shards are contextually activated to begin with. For example, if a person hates doing jumping jacks in front of a group of her peers, then that part of herself can bargain to penalize jumping jacks just in those contexts in the final utility function. Compared to a blanket "no jumping jacks ever" rule, this trade is less costly to other shards and allows more efficient trades to occur.

A dangerous intuition pump here would be something like, "If you take a human who was trained really hard in childhood to have faith in God and show epistemic deference to the Bible, and inspecting the internal contents of their thought at age 20 showed that they still had great faith, if you kept amping up that human's intelligence their epistemology would at some point explode"

Yes, a value grounded in a factual error will get blown up by better epistemics, just as "be uncertain about the human's goals" will get blown up by your beliefs getting their entr... (read more)

You seem to mostly be imagining a third category:

3. If you optimize a model to be corrigible in one situation, how likely is it to still be corrigible in a new situation?

I don't care about question 3. It's been more than 4 years since I even seriously discussed the possibility of learning on a mechanism like that, and even at that point it was not a very serious discussion.

"Don't care" is quite strong. If you still hold this view -- why don't you care about 3? (Curious to hear from other people who basically don't care about 3, either.)

4Paul Christiano16d
Yeah, "don't care" is much too strong. This comment was just meant in the context of the current discussion. I could instead say: However, I agree that there are lots of approaches to building AI that rely on some kind of generalization of corrigibility, and that studying those is interesting and I do care about how that goes. In the context of this discussion I also would have said that I don't care about whether honesty generalizes. But that's also something I do care about even though it's not particularly relevant to this agenda (because the agenda is attempting to solve alignment under considerably more pessimistic assumptions).

I agree that similar environments are important, but I don't see why you think they explain most of the outcomes. What's an example of a "gene-environment interaction with parts of the environment that aren't good at predicting the child's moral development"?

Like, what it feels like to understand human inductive biases isn't to think "Gee, I understand inductive biases!". It's more like: "I see that my son just scowled after agreeing to clean his room. This provides evidence about his internal motivational composition, even though I can't do interpre... (read more)

What's an example of a "gene-environment interaction with parts of the environment that aren't good at predicting the child's moral development"?

Mimicking adult behavior even when the adult isn't paying any attention to the child (and children with different genes having slightly different sorts of mimicry). Automatically changing purity norms in response to disease and perceived disease risk. Having a different outlook on the world if you always had plenty of food growing up. Children of athletic parents often being athletic too, which changes how they... (read more)

Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.

Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say "RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation." And then, of course, we should compare t... (read more)

Shard theory suggests that goals are more natural to specify/inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent's internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards).

I could (and did) hope that I could sp... (read more)

I am a time-traveler who came back from the world where it (super duper predictably) turned out that a lot of early bright hopes didn't pan out and various things went WRONG and alignment was HARD and it was NOT SOLVED IN ONE SUMMER BY TEN SMART RESEARCHERS

I think these kinds of comments update readers' beliefs in a bad, invalid way. The bad event (AGI ruin) is argued for by... a request for me to condition on testimony of a survivor of that bad event. Yes, I know the whole thing is tongue-in-cheek. I know that EY is not literally claiming to be a time-tra... (read more)

"Globally activated consequentialist reasoning is convergent as agents get smarter" is dealt an evidential blow by von Neumann:

Although von Neumann unfailingly dressed formally, he enjoyed throwing extravagant parties and driving hazardously (frequently while reading a book, and sometimes crashing into a tree or getting arrested). He once reported one of his many car accidents in this way: "I was proceeding down the road. The trees on the right were passing me in orderly fashion at 60 miles per hour. Suddenly one of them stepped in my path." He was a profo

The most straightforward way to produce evidence of a model’s deception is to find a situation where it changes what it’s doing based on the presence or absence of oversight. If we can find a clear situation where the model’s behavior changes substantially based on the extent to which its behavior is being overseen, that is clear evidence that the model was only pretending to be aligned for the purpose of deceiving the overseer.

This isn't clearly true to me, at least for one possible interpretation: "If the AI knows a human is present, the AI's behavior ch... (read more)

4Evan Hubinger23d
Yeah, that's a good point—I agree that the thing I said was a bit too strong. I do think there's a sense in which the models you're describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you're describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.

I haven't fully understood all of your points, but they gloss as reasonable and good. Thank you for this high-effort, thoughtful comment!

(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)

I encourage applicants to also read Quintin's Evolution is a bad analogy for AGI (which I wish more people had read, I think it's quite important). I think that evolution-based analogies ... (read more)

I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.

Like, you said:

shard theory resembles RLHF, and seems to share its flaws

So, if some alignment theory says "this approach (e.g. RLHF) is flawed and probably won't produce human-compatible values", and we notice "shard theory resembles RLHF", then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I'd update against the alignment theory / re... (read more)

2tailcalled23d
How about this one: Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception. Not necessarily a general deception shard, just it spawns some sort of shard that repeats similar things to what it did before, which presumably means more often errorneously making us believe it saved a person's life. Whether that's deception or approval-seeking or donating-to-charities-without-regard-for-effectiveness or something else. Your post points out that you can do all sorts of things in theory if you "have enough write access to fool credit assignment". But that's not sufficient to show that they can happen in practice. You gotta propose system of write access and training to use this write access to do what you are proposing. Would you not agree that models are unaligned by default, unless there is something that aligns them?

it's just a huge gap from "Rewards have unpredictable effects on agent's cognition, not necessarily to cause them to want reward" to "we have a way to use RL to interpret and implement human wishes."

So, OP said

In general, we have no way to use RL to actually interpret and implement human wishes, rather than to optimize some concrete and easily-calculated reward signal.

I read into this a connotation of "In general, there isn't a practically-findable way to use RL...". I'm now leaning towards my original interpretation being wrong -- that you meant som... (read more)

I agree that this need not produce a policy that is trying to get reward, just one that in fact gets a lot of reward on distribution.

I think this tells us relatively little about the internal cognition, and so is a relatively non-actionable fact (which you probably agree with?). But I want to sort out my thoughts more, here, before laying down more of my intuitions on that.

Related clarifying question:

To illustrate how this can go wrong, imagine using RL to implement a decentralized autonomous organization (DAO) which maximizes its profit[...]

The shareholde

similar to how if you read about enough social hacks you'll probably be a bit scammy even tho you like people and don't want to scam them

IDK if this is causally true or just evidentially true. I also further don't know why it would be mechanistically relevant to the heuristic you posit.

Rather, I think that agents might end up with this heuristic at first, but over time it would get refined into "try new things which [among other criteria] aren't obviously going to cause bad value drift away from current values." One reason I expect the refinement in ... (read more)

I regret that this post doesn't focus on practical advice derived from shard theory. Instead, I mostly focused on a really cool ideal-agency trick ("pretend really hard to wholly fool your own credit assignment"), which is cool but impracticable for real people (joining the menagerie currently inhabited by e.g. logical inductors, value handshakes, and open-source game theory).

I think that shard theory suggests a range of practical ways to improve your own value formation and rationality. For example, suppose I log in and see that my friend John compl... (read more)

As a general approach to avoiding value drift

One interpretation of this phrase is that we want AI to generally avoid value drift -- to get good values in the AI, and then leave it. (This probably isn't what you meant, but I'll leave a comment for other readers!) For AI and for humans, value drift need not be bad. In the human case, going to anger management can be humanely-good value drift. And human-aligned shards of a seed AI can deliberately steer into more situations where the AI gets rewarded while helping people, in order to reinforce the human-aligned coalitional weight.

I'm interested in why you doubt this? I can imagine various interpretations of the quote which I doubt, and some which are less doubtful-to-me.

2Charlie Steiner24d
The reason babies grow up into people that share our values has very little to do with our understanding of their inductive biases (i.e. most of the work is done by gene-environment interactions with parts of the environment that aren't good at predicting the child's moral development). The primary meaning of this comment is pointing out that a particular statement about children is wrong in a kind-of-funny way. I have this sort of humorous image of someone raising a child, saying "Phew, thank goodness I had a good understanding of my child's inductive biases, I never could have gotten them to have similar values to me just by passing on half of my genes to them and raising them in an environment similar to the one I grew up in."

Overall I think "simulators" names a useful concept. I also liked how you pointed out and deconfused type errors around "GPT-3 got this question wrong." Other thoughts:

I wish that that you more strongly ruled out "reward is the optimization target" as an interpretation of the following quotes:

RL’s archetype of an agent optimized to maximize free parameters (such as action-trajectories) relative to a reward function.

...

Simulators like GPT give us methods of instantiating intelligent processes, including goal-directed agents, with methods other than optimizi

what exactly is wrong with saying the reward function I described above captures what I really want?

Well, first of all, that reward function is not outer aligned to TTT, by the following definition:

“My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.”

There exist models which just wirehead or set the reward to +1 or show ... (read more)

I think that evolution is not the relevant optimizer for humans in this situation. Instead consider the within-lifetime learning that goes on in human brains. Humans are very probably reinforcement learning agents in a relevant sense; in some ways, humans are the best reinforcement learning agents we have ever seen.

I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences.

Why this seems true:

1. Any planning process which robustly succeeds must behave differently in the presence of different latent problems.
1. If I'm going to the store and one of two routes may be closed down, and I want to always arrive at the store, my plan must somehow behave differently in the p
4Evan Hubinger1mo
Yep, I agree with that. That's orthogonal to myopia as I use the term, though.

In general, we have no way to use RL to actually interpret and implement human wishes, rather than to optimize some concrete and easily-calculated reward signal.

I feel confused by this sentence. Reward is not the optimization target. Reward provides cognitive updates to the agent. ETA: So, shouldn't wisely-selected reward schedules produce good cognitive updates, which produces a mind which implements human wishes?

3Paul Christiano1mo
I don't think we know how to pick rewards that would implement human wishes. It's great if people want to propose wise strategies and then argue or demonstrate that they have that effect. On the other side: if you have an easily measurable reward function, then you can find a policy that gets a high reward by using RL (essentially gradient descent on "how much reward does the policy get.") I agree that this need not produce a policy that is trying to get reward, just one that in fact gets a lot of reward on distribution.

Why do many people think RL will produce "agents", but maybe (self-)supervised learning ((S)SL) won't? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let's consider the technical differences between the training regimes.

In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s). Now we arrive at some differences.

Some of this isn't new (see Hidden Incentives for Auto-Induced Distributional Shift), b... (read more)

2Steve Byrnes1mo

I think a lot (but probably not all) of the standard objections don't make much sense to me anymore. Anyways, can you say more here, so I can make sure I'm following?

If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.

(A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)

5tailcalled1mo
So I guess if we want to be concrete, the most obvious place to start would be classical cases where RLHF has gone wrong. Like a gripper pretending to pick up an object by placing its hand in front of the camera, or a game-playing AI pretending to make progress by replaying the same part of the game over and over again. Though these are "easy" in the sense that they seem correctable by taking more context into consideration. One issue with giving concrete examples is that I think nobody has gotten RLHF to work in problems that are too "big" for humans to have all the context. So we don't really know how it would work in the regime where it seems irreparably dangerous. Like I could say "what if we give it the task of coming up with plans for an engineering project and it has learned to not make pollution that causes health problems obvious? Due to previously having suggested a design with obvious pollution and having that design punished", but who knows how RLHF will actually be used in engineering?

What do power differentials have to do with the kind of mechanistic training story posited by shard theory?

The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person's brain, such that the post-reinforcement person is incrementally "more prosocial." But the important part isn't "feedback signals from other people with ~equal power", it's the transduced reinforcement events which increase prosociality.

So let's figure out how to supply good reinforce... (read more)

4tailcalled1mo
I guess to me, shard theory resembles RLHF, and seems to share its flaws (unless this gets addressed in a future post or I missed it in one of the existing posts or something). So for instance learning values by reinforcement events seems likely to lead to deception. If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced. This doesn't become much of a problem in practice among humans (or well, it actually does seem to be a fairly significant problem, but not x-risk level significant), but the most logical reinforcement-based reason I can see why it doesn't become a bigger problem is that people cannot reliably deceive each other. (There may also be innate honesty instincts? But that runs into genome inaccessibility problems.) These seem like standard objections around here so I assume you've thought about them. I just don't notice those thoughts anywhere in the work.

I do have a question about your claim that shards are not full subagents. I understand that in general different shards will share parameters over their world-model, so in that sense they aren't fully distinct — is this all you mean? Or are you arguing that even a very complicated shard with a long planning horizon (e.g., "earn money in the stock market" or some such) isn't agentic by some definition?

I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard eco... (read more)

1Edouard Harris1mo
Got it. That makes sense, thanks!
3Kaj Sotala1mo
What's your take on "parts work" techniques like IDC, IFS, etc. seeming to bring up something like [https://www.lesswrong.com/posts/YXBpBCNC66daaofoY/my-current-take-on-internal-family-systems-parts] private (or at least not completely shared) world models? Do you consider the kinds of "parts" those access as being distinct from shards? I would find it plausible to assume by default that shards have something like differing world models since we know from cognitive psychology that e.g. different emotional states tend to activate similar memories (easier to remember negative things about your life when you're upset than if you are happy), and different emotional states tend to activate different shards. I suspect that something like the Shadlen & Shohamy take on decision-making [https://www.lesswrong.com/posts/7zQPYQB5EeaqLrhBh/subagents-neural-turing-machines-thought-selection-and] might be going on: Under that view, I think that shards would effectively have separate world models, since each physically separate memory network suggesting that an action is good or bad is effectively its own shard; and since a memory network is a miniature world model, there's a sense in which shards are nothing but separate world models. E.g. the memory of "licking the juice tasted sweet" is a miniature world model according to which licking the juice lets you taste something sweet, and is also a shard. (Or at least it forms an important component of a shard.) That miniature world model is separate from the shard/memory network/world model holding instances of times when adults taught the child to say "thank you" when given something; the latter shard only has a world model of situations where you're expected to say "thank you", and no world model of the consequences of licking juice.

What I think you do mean:

This is an excellent guess and correct (AFAICT). Thanks for supplying so much interpretive labor!

What I think you intend to contrast this to: "Every detail of human values has to be specified in the genome - the complexity of the values and the complexity of the genome have to be closely related."

I'd say our position contrasts with "A substantial portion of human value formation is genetically pre-determined in a complicated way, such that values are more like adaptations and less like exaptations—more like contextually-activ... (read more)

Are you asking about the relevance of understanding human value formation? If so, see Humans provide an untapped wealth of evidence about alignment. We know of exactly one form of general intelligence which grows human-compatible values: humans. So, if you want to figure out how human-compatible values can form at all, start by understanding how they have formed empirically.

But perhaps you're asking something like "how does this perspective imply anything good for alignment?" And that's something we have deliberately avoided discussing for now. More in future posts.

3tailcalled1mo
I'm basically re-raising the point I asked about in your linked post; the alignability of sharded humans seems to be due to people living in a society that gives them feedback on their behavior that they have to follow. This allows cooperative shards to grow. It doesn't seem like it would generalize to more powerful beings.

This seems wrong to me. Twin studies, GCTA estimates, and actual genetic predictors all predict that a portion of the variance in human biases is "hardcoded" in the genome.

I'd also imagine that mathematical skill is heritable. [Finds an article on Google Scholar] The abstract of https://doi.org/10.1037/a0015115 seems to agree. Yet due to information inaccesibility and lack of selection pressure ancestrally, I infer math ability probably isn't hardcoded.

There are a range of possible explanations which reconcile these two observations, like "better gen... (read more)

Research-guiding heuristic: "What would future-TurnTrout predictably want me to get done now?"

A problem with adversarial training. One heuristic I like to use is: "What would happen if I initialized a human-aligned model and then trained it with my training process?"

So, let's consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update t... (read more)