In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I'd feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors.
In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that's the part which is a pointer problem.
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
the pointer problem part is roughly "specify what I mean well enough that I could use the specification to get an AI to do what I mean", assuming problems like "get AI to follow specification" can be solved.
As an example, here are three possible reactions to a no-ghost update:
Suppose that many (EDIT: a few) of your value shards take as input the ghost latent variable in your world model. You learn ghosts aren't real. Let's say this basically sets the ghost-related latent variable value to false in all shard-relevant contexts. Then it seems perfectly fine that most of my shards keep on bidding away and determining my actions (e.g. protect my family), since most of my value shards are not in fact functions of the ghost latent variable. While it's indeed possibl... (read more)
I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not ... (read more)
The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans. In a sense, this means having aligned goals without having the same goals: your goal is to cooperate with "human goals", but you don't yet have a full description of what human goals are. Your value function might be much simpler than the human value function.In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.
The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans. In a sense, this means having aligned goals without having the same goals: your goal is to cooperate with "human goals", but you don't yet have a full description of what human goals are. Your value function might be much simpler than the human value function.
In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.
Insofar as I understa... (read more)
I feel confused. I think this comment is overall good (though I don't think I understand a third of it), but doesn't seem to suggest the genome actually solved information inaccessibility in the form of reliably locating learned WM concepts in the human brain?
But, do you really fundamentally care that your kids have genomes?
Seems not relevant? I think we're running into an under-definition of IGF (and the fact that it doesn't actually have a utility function, even over local mutations on a fixed genotype). Does IGF have to involve genomes, or just information patterns as written in nucleotides or in binary? The "outer objective" of IGF suffers a classic identifiability issue common to many "outer objectives", where the ancestral "training signal" history is fully compatible with "IGF just for genomes" and also ... (read more)
I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus...
And insofar as this impression is correct, this is a mistake. There is only one way alignment is.
If inner/outer is altogether a more faithful picture of those dynamics:
FWIW I think the most important distinction in "alignment" is aligning with somebody's preferences versus aligning with what is actually good, and I increasingly have the sense that the former does not lead in any limit to the latter.
I have an upcoming post which might be highly relevant. Many proposals which black-box human judgment / model humans, aren't trying to get an AI which optimizes what people want. They're getting an AI to optimize evaluations of plans—the quotation of human desires, as quoted via those evaluations. And I think that's a subtle distinction which can prove quite fatal.
I think that's also a good thing to think about, but most of the meat is in how you actually reason about that and how it leads to superior or at least adequate+complementary predictions about the behavior of ML systems. I think to the extent this perspective is useful for alignment it also ought to be useful for reasoning about the behavior of existing systems like large language models
Sure. To clarify, superior to what? "GPT-3 reliably minimizes prediction error; it is inner-aligned to its training objective"?
because I don't know what alignment means that I think it's helpful to have some hand-hold terms like "alignment"
Do you mean "outer/inner alignment"?
Supposing you mean that—I agree that it's good to say "and I'm confused about this part of the problem", while also perhaps saying "assuming I've formulated the problem correctly at all" and "as I understand it."
I don't really disagree with anything you've written, but, in general, I think we should allow some of our words to refer to "big confusing problems" that we don't yet know how to clarify, because we s
I plan to mentor several people to work on shard theory and agent foundations this winter through SERI MATS. Apply here if you're interested in working with me and Quintin.
I was talking with Abram Demski today about a promising-seeming research direction. (Following is my own recollection)
One of my (TurnTrout's) reasons for alignment optimism is that I think:
I think I have two main complaints still, on a skim.
First, I think the following is wrong:
These problems seem pretty hard, as evidenced by the above gaps in the example shard theory plan, and it's not clear that they're easier than inner and outer alignment.
I think outer and inner alignment both go against known/suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more nat... (read more)
I think these are great counterpoints. Thanks for making them.
I still buy "the helicopter parent 'outer alignment' training regime is unwise for 'aligning' kids" and that deliberate parenting is better than chance. But possibly/probably not the primary factor. I haven't yet read much data here so my views feel relatively unconstrained, beyond my "common sense."
I think there's an additional consideration with AI, though: We control the reward circuitry. If lots of variance in kid-alignment is due to genetic variation in reward circuitry or learning hyperparameters or whatever, then we also control that with AI, that is also part of understanding AI inductive biases.
I don't know.
Speculatively, jealousy responses/worries could be downstream of imitation/culture (which "raises the hypothesis"/has self-supervised learning ingrain the completion, such that now the cached completion is a consequence which can be easily hit by credit assignment / upweighted into a real shard). Another source would be negative reward events on outcomes where you end up alone / their attentions stray. Which, itself, isn't from simple reward circuitry, but a generalization of other learned reward events which I expect are themselves downstream of simple reward circuitry. (Not that that reduces my confusion much)
Argument that you can't use a boundedly intelligent ELK solution to search over plans to find one which keeps the diamond in the vault. That is, the ELK solution probably would have to be at least as smart (or smarter) than the plan-generator.
Consider any situation where it's hard to keep the diamond in the vault. Then any successful plan will have relatively few degrees of freedom. Like, a bunch of really smart thieves will execute a cunning plot to extract the diamond. You can't just sit by or deploy some simple traps in this situation.
Therefore, any pla... (read more)
Before reading, here are my reactions to the main claims:
Not only does RL not, by default, produce policies which have reward maximization as their behavioral objective, but in fact I argue that it is not possible for RL policies to care about “reward” in an embedded setting. (agreed under certain definitions of "reward")I argue that this does not imply that wireheading in RL agents is impossible, because wireheading does not mean “the policy has reward as its objective”.It is still possible for an RL agent to wirehead, and in fact, is a hig
I'm often asked, Why "shard theory"? I suggested this name to Quintin when realizing that human values have the type signature of contextually activated decision-making influences. The obvious choice, then, was to call these things "shards of value"—drawing inspiration from Eliezer's Thou art godshatter, where he originally wrote "shard of desire."
(Contrary to several jokes, the choice was not just "because 'shard theory' sounds sick.")
This name has several advantages. Value-shards can have many subshards/facets which vary contextually (a real crysta... (read more)
I think that shards will cast contextual shadows into the factors of a person’s equilibrated utility function, because I think the shards are contextually activated to begin with. For example, if a person hates doing jumping jacks in front of a group of her peers, then that part of herself can bargain to penalize jumping jacks just in those contexts in the final utility function. Compared to a blanket "no jumping jacks ever" rule, this trade is less costly to other shards and allows more efficient trades to occur.
A dangerous intuition pump here would be something like, "If you take a human who was trained really hard in childhood to have faith in God and show epistemic deference to the Bible, and inspecting the internal contents of their thought at age 20 showed that they still had great faith, if you kept amping up that human's intelligence their epistemology would at some point explode"
Yes, a value grounded in a factual error will get blown up by better epistemics, just as "be uncertain about the human's goals" will get blown up by your beliefs getting their entr... (read more)
You seem to mostly be imagining a third category:3. If you optimize a model to be corrigible in one situation, how likely is it to still be corrigible in a new situation?I don't care about question 3. It's been more than 4 years since I even seriously discussed the possibility of learning on a mechanism like that, and even at that point it was not a very serious discussion.
You seem to mostly be imagining a third category:
3. If you optimize a model to be corrigible in one situation, how likely is it to still be corrigible in a new situation?
I don't care about question 3. It's been more than 4 years since I even seriously discussed the possibility of learning on a mechanism like that, and even at that point it was not a very serious discussion.
"Don't care" is quite strong. If you still hold this view -- why don't you care about 3? (Curious to hear from other people who basically don't care about 3, either.)
I agree that similar environments are important, but I don't see why you think they explain most of the outcomes. What's an example of a "gene-environment interaction with parts of the environment that aren't good at predicting the child's moral development"?
Like, what it feels like to understand human inductive biases isn't to think "Gee, I understand inductive biases!". It's more like: "I see that my son just scowled after agreeing to clean his room. This provides evidence about his internal motivational composition, even though I can't do interpre... (read more)
What's an example of a "gene-environment interaction with parts of the environment that aren't good at predicting the child's moral development"?
What's an example of a "gene-environment interaction with parts of the environment that aren't good at predicting the child's moral development"?
Mimicking adult behavior even when the adult isn't paying any attention to the child (and children with different genes having slightly different sorts of mimicry). Automatically changing purity norms in response to disease and perceived disease risk. Having a different outlook on the world if you always had plenty of food growing up. Children of athletic parents often being athletic too, which changes how they... (read more)
Sometimes in software development, you may be worried that there is a security problem in the program you are making. But if you speak out loud about it, then that generates FUD among the users, which discourages you from speaking loud in the future. Hence, RLHF in a human context generates deception.
Thanks for the example. The conclusion is far too broad and confident, in my opinion. I would instead say "RLHF in a human context seems to have at least one factor which pushes for deception in this kind of situation." And then, of course, we should compare t... (read more)
Shard theory suggests that goals are more natural to specify/inculcate in their shard-forms (e.g. if around trash and a trash can, put the trash away), and not in their (presumably) final form of globally activated optimization of a coherent utility function which is the reflective equilibrium of inter-shard value-handshakes (e.g. a utility function over the agent's internal plan-ontology such that, when optimized directly, leads to trash getting put away, among other utility-level reflections of initial shards).
I could (and did) hope that I could sp... (read more)
I am a time-traveler who came back from the world where it (super duper predictably) turned out that a lot of early bright hopes didn't pan out and various things went WRONG and alignment was HARD and it was NOT SOLVED IN ONE SUMMER BY TEN SMART RESEARCHERS
I think these kinds of comments update readers' beliefs in a bad, invalid way. The bad event (AGI ruin) is argued for by... a request for me to condition on testimony of a survivor of that bad event. Yes, I know the whole thing is tongue-in-cheek. I know that EY is not literally claiming to be a time-tra... (read more)
"Globally activated consequentialist reasoning is convergent as agents get smarter" is dealt an evidential blow by von Neumann:
Although von Neumann unfailingly dressed formally, he enjoyed throwing extravagant parties and driving hazardously (frequently while reading a book, and sometimes crashing into a tree or getting arrested). He once reported one of his many car accidents in this way: "I was proceeding down the road. The trees on the right were passing me in orderly fashion at 60 miles per hour. Suddenly one of them stepped in my path." He was a profo
The most straightforward way to produce evidence of a model’s deception is to find a situation where it changes what it’s doing based on the presence or absence of oversight. If we can find a clear situation where the model’s behavior changes substantially based on the extent to which its behavior is being overseen, that is clear evidence that the model was only pretending to be aligned for the purpose of deceiving the overseer.
This isn't clearly true to me, at least for one possible interpretation: "If the AI knows a human is present, the AI's behavior ch... (read more)
I haven't fully understood all of your points, but they gloss as reasonable and good. Thank you for this high-effort, thoughtful comment!
(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)
I encourage applicants to also read Quintin's Evolution is a bad analogy for AGI (which I wish more people had read, I think it's quite important). I think that evolution-based analogies ... (read more)
I was more wondering about situations in humans which you thought had the potential to be problematic, under the RLHF frame on alignment.
Like, you said:
shard theory resembles RLHF, and seems to share its flaws
So, if some alignment theory says "this approach (e.g. RLHF) is flawed and probably won't produce human-compatible values", and we notice "shard theory resembles RLHF", then insofar as shard theory is actually true, RLHF-like processes are the only known generators of human-compatible values ever, and I'd update against the alignment theory / re... (read more)
it's just a huge gap from "Rewards have unpredictable effects on agent's cognition, not necessarily to cause them to want reward" to "we have a way to use RL to interpret and implement human wishes."
So, OP said
In general, we have no way to use RL to actually interpret and implement human wishes, rather than to optimize some concrete and easily-calculated reward signal.
I read into this a connotation of "In general, there isn't a practically-findable way to use RL...". I'm now leaning towards my original interpretation being wrong -- that you meant som... (read more)
I agree that this need not produce a policy that is trying to get reward, just one that in fact gets a lot of reward on distribution.
I think this tells us relatively little about the internal cognition, and so is a relatively non-actionable fact (which you probably agree with?). But I want to sort out my thoughts more, here, before laying down more of my intuitions on that.
Related clarifying question:
To illustrate how this can go wrong, imagine using RL to implement a decentralized autonomous organization (DAO) which maximizes its profit[...]The shareholde
To illustrate how this can go wrong, imagine using RL to implement a decentralized autonomous organization (DAO) which maximizes its profit[...]
similar to how if you read about enough social hacks you'll probably be a bit scammy even tho you like people and don't want to scam them
IDK if this is causally true or just evidentially true. I also further don't know why it would be mechanistically relevant to the heuristic you posit.
Rather, I think that agents might end up with this heuristic at first, but over time it would get refined into "try new things which [among other criteria] aren't obviously going to cause bad value drift away from current values." One reason I expect the refinement in ... (read more)
I regret that this post doesn't focus on practical advice derived from shard theory. Instead, I mostly focused on a really cool ideal-agency trick ("pretend really hard to wholly fool your own credit assignment"), which is cool but impracticable for real people (joining the menagerie currently inhabited by e.g. logical inductors, value handshakes, and open-source game theory).
I think that shard theory suggests a range of practical ways to improve your own value formation and rationality. For example, suppose I log in and see that my friend John compl... (read more)
As a general approach to avoiding value drift
One interpretation of this phrase is that we want AI to generally avoid value drift -- to get good values in the AI, and then leave it. (This probably isn't what you meant, but I'll leave a comment for other readers!) For AI and for humans, value drift need not be bad. In the human case, going to anger management can be humanely-good value drift. And human-aligned shards of a seed AI can deliberately steer into more situations where the AI gets rewarded while helping people, in order to reinforce the human-aligned coalitional weight.
I'm interested in why you doubt this? I can imagine various interpretations of the quote which I doubt, and some which are less doubtful-to-me.
Overall I think "simulators" names a useful concept. I also liked how you pointed out and deconfused type errors around "GPT-3 got this question wrong." Other thoughts:
I wish that that you more strongly ruled out "reward is the optimization target" as an interpretation of the following quotes:
RL’s archetype of an agent optimized to maximize free parameters (such as action-trajectories) relative to a reward function....Simulators like GPT give us methods of instantiating intelligent processes, including goal-directed agents, with methods other than optimizi
RL’s archetype of an agent optimized to maximize free parameters (such as action-trajectories) relative to a reward function.
Simulators like GPT give us methods of instantiating intelligent processes, including goal-directed agents, with methods other than optimizi
what exactly is wrong with saying the reward function I described above captures what I really want?
Well, first of all, that reward function is not outer aligned to TTT, by the following definition:
“My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.” -- Evan Hubinger, commenting on "Inner Alignment Failures" Which Are Actually Outer Alignment Failures
“My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.”
-- Evan Hubinger, commenting on "Inner Alignment Failures" Which Are Actually Outer Alignment Failures
There exist models which just wirehead or set the reward to +1 or show ... (read more)
I think that evolution is not the relevant optimizer for humans in this situation. Instead consider the within-lifetime learning that goes on in human brains. Humans are very probably reinforcement learning agents in a relevant sense; in some ways, humans are the best reinforcement learning agents we have ever seen.
I read Yudkowsky as positing some kind of conservation law. Something like, if the plans produced by your AI succeed at having specifically chosen far-reaching consequences if implemented, then the AI must have done reasoning about far-reaching consequences.
Why this seems true:
I feel confused by this sentence. Reward is not the optimization target. Reward provides cognitive updates to the agent. ETA: So, shouldn't wisely-selected reward schedules produce good cognitive updates, which produces a mind which implements human wishes?
Why do many people think RL will produce "agents", but maybe (self-)supervised learning ((S)SL) won't? Historically, the field of RL says that RL trains agents. That, of course, is no argument at all. Let's consider the technical differences between the training regimes.
In the modern era, both RL and (S)SL involve initializing one or more neural networks, and using the reward/loss function to provide cognitive updates to the network(s). Now we arrive at some differences.
Some of this isn't new (see Hidden Incentives for Auto-Induced Distributional Shift), b... (read more)
I think a lot (but probably not all) of the standard objections don't make much sense to me anymore. Anyways, can you say more here, so I can make sure I'm following?
If there's some experience that deceives people to provide feedback signals that the behavior was prosocial, then it seems like the shard that leads to that experience will be reinforced.
(A concrete instantiated scenario would be most helpful! Like, Bob is talking with Alice, who gives him approval-reward of some kind when he does something she wants, and then...)
What do power differentials have to do with the kind of mechanistic training story posited by shard theory?
The mechanistically relevant part of your point seems to be that feedback signals from other people probably transduce into reinforcement events in a person's brain, such that the post-reinforcement person is incrementally "more prosocial." But the important part isn't "feedback signals from other people with ~equal power", it's the transduced reinforcement events which increase prosociality.
So let's figure out how to supply good reinforce... (read more)
I do have a question about your claim that shards are not full subagents. I understand that in general different shards will share parameters over their world-model, so in that sense they aren't fully distinct — is this all you mean? Or are you arguing that even a very complicated shard with a long planning horizon (e.g., "earn money in the stock market" or some such) isn't agentic by some definition?
I currently guess that even the most advanced shards won't have private world-models which they can query in relative isolation from the rest of the shard eco... (read more)
What I think you do mean:
This is an excellent guess and correct (AFAICT). Thanks for supplying so much interpretive labor!
What I think you intend to contrast this to: "Every detail of human values has to be specified in the genome - the complexity of the values and the complexity of the genome have to be closely related."
I'd say our position contrasts with "A substantial portion of human value formation is genetically pre-determined in a complicated way, such that values are more like adaptations and less like exaptations—more like contextually-activ... (read more)
Are you asking about the relevance of understanding human value formation? If so, see Humans provide an untapped wealth of evidence about alignment. We know of exactly one form of general intelligence which grows human-compatible values: humans. So, if you want to figure out how human-compatible values can form at all, start by understanding how they have formed empirically.
But perhaps you're asking something like "how does this perspective imply anything good for alignment?" And that's something we have deliberately avoided discussing for now. More in future posts.
This seems wrong to me. Twin studies, GCTA estimates, and actual genetic predictors all predict that a portion of the variance in human biases is "hardcoded" in the genome.
I'd also imagine that mathematical skill is heritable. [Finds an article on Google Scholar] The abstract of https://doi.org/10.1037/a0015115 seems to agree. Yet due to information inaccesibility and lack of selection pressure ancestrally, I infer math ability probably isn't hardcoded.
There are a range of possible explanations which reconcile these two observations, like "better gen... (read more)
Research-guiding heuristic: "What would future-TurnTrout predictably want me to get done now?"
A problem with adversarial training. One heuristic I like to use is: "What would happen if I initialized a human-aligned model and then trained it with my training process?"
So, let's consider such a model, which cares about people (i.e. reliably pulls itself into futures where the people around it are kept safe). Suppose we also have some great adversarial training technique, such that we have e.g. a generative model which produces situations where the AI would break out of the lab without permission from its overseers. Then we run this procedure, update t... (read more)