That is to say, prior to "simulators" and "shard theory", a lot of focus was on utility-maximizers--agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.
FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to "ar...
Lol, cool. I tried the "4 minute" challenge (without having read EY's answer, but having read yours).
...Hill-climbing search requires selecting on existing genetic variance on alleles already in the gene pool. If there isn’t a local mutation which changes the eventual fitness of the properties which that genotype unfolds into, then you won’t have selection pressure in that direction. On the other hand, gradient descent is updating live on a bunch of data in fast iterations which allow running modifications over the parameters themselves. It’s like being
Yeah, this read really bizarrely to me. This is a good way of making sense of that section, maybe. But then I'm still confused why Scott concluded "oh I was just confused in this way" and then EY said "yup that's why you were confused", and I'm still like "nope Scott's question seems correctly placed; evolutionary history is indeed screened off by the runtime hyperparameterization and dataset."
Thanks for leaving this comment, I somehow only just now saw it.
Given your pseudocode it seems like the only point of
planModificationSample
is to produce plan modifications that lead to high outputs ofself.diamondShard(self.WM.getConseq(plan))
. So why is that not "optimizing the outputs of the grader as its main terminal motivation"?
I want to make a use/mention distinction. Consider an analogous argument:
"Given gradient descent's pseudocode it seems like the only point of backward
is to produce parameter modifications that lead to low outputs of loss_fn
....
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model
FWIW I don't consider myself to be arguing against planning over a world model.
...the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values.
Most human v
Strong upvoted. I appreciate the strong concreteness & focus on internal mechanisms of cognition.
Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) -> action mappings. EG a shard agent might have:
But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)
In this way, changing a small set of decision-relevant features (e.g. "Br...
I think it's worth it to use magic as a term of art, since it's 11 fewer words than "stuff we need to remind ourselves we don't know how to do," and I'm not satisfied with "free parameters."
11 fewer words, but I don't think it communicates the intended concept!
If you have to say "I don't mean one obvious reading of the title" as the first sentence, it's probably not a good title. This isn't a dig -- titling posts is hard, and I think it's fair to not be satisfied with the one I gave. I asked ChatGPT to generate several new titles; lightly edited:
...
- "Unc
The policy of truth is a blog post about why policy gradient/REINFORCE suck. I'm leaving a shortform comment because it seems like a classic example of wrong RL theory and philosophy, since reward is not the optimization target. Quotes:
Our goal remains to find a policy that maximizes the total reward after time steps.
And hence the following is a general purpose algorithm for maximizing rewards with respect to parametric distributions:
...If you start with a reward function whose values are in and you subtract one million
"Magic," of course, in the technical sense of stuff we need to remind ourselves we don't know how to do. I don't mean this pejoratively, locating magic is an important step in trying to demystify it.
I think this title suggests a motte/bailey, and also seems clickbait-y. I think most people scanning the title will conclude you mean it in a perjorative sense, such that shard theory requires impossibilities or unphysical miracles to actually work. I think this is clearly wrong (and I imagine you to agree). As such, I've downvoted for the moment.
AFAICT y...
As a datapoint, I remember briefly talking with Eliezer in July 2021, where I said "If only we could make it really cringe to do capabilities/gain-of-function work..." (I don't remember which one I said). To which, I think he replied "That's not how human psychology works."
I now disagree with this response. I think it's less "human psychology" and more "our current sociocultural environment around these specific areas of research." EG genetically engineering humans seems like a thing which, in some alternate branches, is considered "cool" and "exciting", while being cringe in our branch. It doesn't seem like a predestined fact of human psychology that that field had to end up being considered cringe.
The existence of the human genome yields at least two classes of evidence which I'm strongly interested in.
...Still on the topic of deception, there are arguments suggesting that something like GPT will always be "deceptive" for Goodhart's Law and Siren World reasons. We can only reward an AI system for producing answers that look good to us, but this incentivizes the system to produce answers that look increasingly good to us, rather than answers that are actually correct. "Looking good" and "being correct" correlate with each other to some extent, but will eventually be pushed apart once there's enough optimization pressure on the "looking good" part.
As such, th
Thanks for registering a guess! I would put it as: a grader optimizer is something which is trying to optimize the outputs of a grader as its terminal end (either de facto, via argmax, or intent-alignment, as in "I wanna search for plans which make this function output a high number"). Like, the point of the optimization is to make the number come out high.
(To help you checksum: It feels important to me that "is good at achieving its goals" is not tightly coupled to "approximating argmax", as I'm talking about those terms. I wish I had fast ways of communicating my intuitions here, but I'm not thinking of something more helpful to say right now; I figured I'd at least comment what I've already written.)
On testing, however, the retrained MB* does not show any visible inclination like this. In retrospect, that made sense - it relied on the assumption that the internal representation of the objective is bidirectional, that the parameter-reward mapping is linear. A high-level update signal in one direction doesn’t necessitate that the inverted signal results in the inverted direction. This direction was a bust, but it was useful for me to make incorrect implicit assumptions like this more explicit.
I think it's improbable that agents internalize a single obje...
Yeah, IMO "RL at scale trains search-based mesa optimizers" hypothesis predicts "solving randomly generated mazes via a roughly unitary mesa objective and heuristic search" with reasonable probability, and that seems like a toy domain to me.
TurnTrout and Garrett have a post about how we shouldn't make agents that choose plans by maximizing against some "grader" (i.e. utility function), because it will adversarially exploit weaknesses in the grader.
To clarify: A "grader" is not just "anything with a utility function", or anything which makes subroutine calls to some evaluative function (e.g. "how fun does this plan seem?"). A grader-optimizer is not an optimizer which has a grader. It is an optimizer which primarily wants to maximize the evaluations of a grader. Compare
I'm going to just reply with my gut responses here, hoping this clarifies how I'm considering the issues. Not meaning to imply we agree or disagree.
which will include agents that maximize the reward in most situations way more often than if you select a random agent.
Probably, yeah. Consider a network which received lots of policy gradients from the cognitive-update-intensity-signals ("rewards"[1]) generated by the "go to coin?" subroutine. I agree that this network will tend to, in the deployment distribution, tend to take actions which average higher sum-...
Strong-upvote, strong disagreevote, thanks so much for writing this out :) I'm not going to reply more until I've cleared up this thread with you (seems important, if you think the pseudocode was a grader-optimizer, well... I think that isn't obviously doomed).
Grader optimizers are not optimizers which use a grader. They are actors which optimize the outputs of the grader as their main terminal motivation. So, grader-optimizers don't have diamond shards. They have diamond shard-shards.
Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where "what if we had sampled during training?" is well-defined and fine. I was wondering if you viewed this as a general question we could ask.
I also agree that Ajeya's post addresses this "ambiguity" question, which is nice!
Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals.
I don't understand why you think this explains away the evidential impact, and I guess I put way less weight on selection reasoning than you do. My reasoning here goes:
Yes, model-based approaches, model-free approaches (with or without critic), AIXI— all of these should be analyzed on their mechanistic details.
Thanks for this comment! I think it makes some sense (but would have been easier to read given meaningful variable names).
Bob's alignment strategy is that he wants X = X1 = Y = Y1 = Z = Z1. Also he wants the end result to be an agent whose good behaviours (Z) are in fact maximising a utility function at all (in this case, Z1).
I either don't understand the semantics of "=" here, or I disagree. Bob's strategy doesn't make sense because X and Z have type behavior
, X1 and Z1 have type utility function
, Y is some abstract reward function over some mathematical ...
Are there convergently-ordered developmental milestones for AI? I suspect there may be convergent orderings in which AI capabilities emerge. For example, it seems that LMs develop syntax before semantics, but maybe there's an even more detailed ordering relative to a fixed dataset. And in embodied tasks with spatial navigation and recurrent memory, there may be an order in which enduring spatial awareness emerges (i.e. "object permanence").
In A shot at the diamond-alignment problem, I wrote:
...[Consider] Let's Agree to Agree: Neural Networks Share Classificat
Quick summary of a major takeaway from Reward is not the optimization target:
Stop thinking about whether the reward is "representing what we want", or focusing overmuch on whether agents will "optimize the reward function." Instead, just consider how the reward and loss signals affect the AI via the gradient updates. How do the updates affect the AI's internal computations and decision-making?
Thus, the heuristics generator can only begin as a generator of heuristics that serve . (Even if it wouldn't start out perfectly pointed at .)
We're apparently anchoring our expectations on "pointed at R", and then apparently allowing some "deviation." The anchoring seems inappropriate to me.
The network can learn to make decisions via a "IF circle-detector fires
, THEN upweight logits on move-right
" subshard. The network can then come to make decisions on the basis of round things, in a way which accords with the policy gradients generated ...
I bid for us to discuss a concrete example. Can you posit a training environment which matches what you're thinking about, relative to a given network architecture [e.g. LSTM]?
And that generator would need to be such that the heuristics it generates are always optimized for achieving , instead of pointing in some arbitrary direction — or, at least, that's how the greedy optimization process would attempt to build it
What is "achieving R" buying us? The agent internally represents a reward function, and then consults what the reward is in this scenario...
That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.
It seems to me like you're assuming a fix-point on updating. Something like "The network eventually will be invariant under reward-updates under all/the vast majority of training-sampled scenarios, and for a wide enough distribution on scenarios, this means optimizing reward directly."
This seems fine to me, under the given assumptions on SGD/evolution. Like, yes, there may exist certain populations of genetically-specified wrapper...
Steve and I talked more, and I think the perceived disagreement stemmed from unclear writing on my part. I recently updated Don't design agents which exploit adversarial inputs to clarify:
...
- ETA 12/26/22: When I write "grader optimization", I don't mean "optimization that includes a grader", I mean "the grader's output is the main/only quantity being optimized by the actor."
- Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I'm not a grader-optimizer relative my internal
plan-is-fun?
gr
Updated with an important terminological clarification:
- ETA 12/26/22: When I write "grader optimization", I don't mean "optimization that includes a grader", I mean "the grader's output is the main/only quantity being optimized by the actor."
- Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I'm not a grader-optimizer relative my internal
plan-is-fun?
grader.- However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.
I don't understand the analogy with humans. It sounds like you are saying "an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward" be analogous to "humans care about the details of their reward circuitry." But:
- I don't think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior.
(Emphasis added)
I don't think this engages with the substance of the analogy to humans. I don't think a...
If people train AI systems on random samples of deployment, then "reward" does make sense---it's just what would happen if you sampled this episode to train on.
I don't know what this means. Suppose we have an AI which "cares about reward" (as you think of it in this situation). The "episode" consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg.
What is the "reward" for this situation? What would have happened if we "sampled" this episode during training?
It is therefore very plausible that RL systems would in fact continue to maximize the reward after training, even if what they're ultimately maximizing is just something highly correlated with it.
What do you mean by this? They would be instrumentally aligned with reward maximization, since reward is necessary for their terminal values?
Can you give an example of such a motivational structure, so I know we're considering the same thing?
...ML systems in general seem to be able to generalize to human-labeled categories in situations that aren't in the training da
Thanks for the story! I may comment more on it later.
You can also get around the "IDK the mechanism" issue if you observe variance over the relevant trait in the population you're selecting over. Like, if we/SGD could veridically select over "propensity to optimize approval OOD", then you don't have to know the mechanism. The variance shows you there are some mechanisms, such that variation in their settings leads to variation in the trait (e.g. approval OOD).
But the designers can't tell that. Can SGD tell that? (This question feels a bit confused, so please extra scan it before answering it)
From their...
I feel like I'm saying something relatively uncontroversial here, which is that if you select agents on the basis of doing well wrt X sufficiently hard enough, you should end up with agents that care about things like X. E.g. if you select agents on the basis of human approval, you should expect them to maximize human approval in situations even where human approval diverges from what the humans "would want if fully informed".
I actually want to controversy that. I'm now going to write quickly about selection arguments in alignment more generally (thi...
If the specific claim was closer to "yes, RL algorithms if ran until convergence have a good chance of 'caring' about reward, but in practice we'll never run RL algorithms that way", then I think this would be a much stronger objection.
This is part of the reasoning (and I endorse Steve's sibling comment, while disagreeing with his original one). I guess I was baking "convergence doesn't happen in practice" into my reasoning, that there is no force which compels agents to keep accepting policy gradients from the same policy-gradient-intensity-producing func...
This post posits that the WM will have a "similar format" throughout, but that heuristics/shards may not. For example, you point out that the WM has to be able to arbitrarily interlace and do subroutine calls (e.g. "will there be a dog near me soon" circuit presumably hooks into "object permanence: track spatial surroundings").
(I confess that I don't quite know what it would mean for shards to have "different encodings" from each other; would they just not have ways to do API calls on each other? Would they be written under eg different "internal programmi...
I think this post is very thoughtful, with admirable attempts at formalization and several interesting insights sprinkled throughout. I think you are addressing real questions, including:
That said, I think some of your answers and conclusions are off/wrong:
What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us.
See also: Inner and outer alignment decompose one hard problem into two extremely hard problems (in particular: Inner alignment seems anti-natural).
Updated mentions of "cognitive groove" to "circuit", since some readers found the former vague and unhelpful.
Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories. can detect this by simply detecting bad stuff.
I think it extremely probable that there exist policies which exploit adversarial inputs to J such that they can do bad stuff while getting J to say "all's fine."
As an implication of this, I could imagine that in most real-world settings "don't kill humans" would act as you describe, but in environments where it's very easy to accidentally kill humans, such that states where you don't kill humans are actually very rare, then the "don't kill humans" shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?
I think that "don't kill humans" can't chain into itself because there's not a real reason for its action-bids to systematically lead to future scenari...
On the other hand, there doesn't seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn't affect the trade-off between two actions as long as the difference was fixed.
This is only true for optimal policies, no? For learned policies, positive reward will upweight and generalize certain circuits (like "approach juice"), while negative reward will downweight and generally-discourage those same circuits. This can then lead to path-dependent differences in generalization (e.g....
Additionally, "loss function" is often used to also refer to the supervised labels in a dataset. EG I don't imagine proponents of "find an aligned loss function" to be imagining moving away from loss and towards KL. They're thinking about dataset labels for each point , and then given a prediction and a loss function , we can provide a loss signal which maps datapoints and predictions to loss: .
Nice update!
On the flip side, I expect we’ll see more discussion about which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them.
While I don't think of these as alignment targets per se (as I understand the term to be used), I strongly support discussing the internal language of the neural net and moving away from convoluted inner/outer schemes.
Similarly, if you solve abstraction, you solve interpretability, shard theory, value alignment, corrigibility, etc.
In what way do you think solving abstraction would solve shard theory?
Eliezer's reasoning is surprisingly weak here. It ... (read more)