Charles Foster


Sorted by New

Wiki Contributions


It would be great if you're able to comment on more directional takeaways for the biological anchors framework. Reading Section 5.4 it's hard to tell at a glance whether each of the points weighs toward an upward revision of long-horizon anchor estimates or a downward one.

Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.

In other words, the agent would need to be autonomous.

Agreed. Generally, whenever I talk about the agent being smart/competent, I am assuming that it is autonomous in the manner you're describing. The only exception would be if I'm specifically talking about a "reflex-agent" or something similar.

This is what I mean by a "sufficiently diverse" environment — an environment that forces the greedy optimization process to build [...] some generator of such heuristics.

That's fine by me. In my language, I would describe this as the agent knowing how to adapt flexibly to new situations. That being said, I don't think this is incompatible with contextual heuristics steering the agent's decision-making. For example, a contextual heuristic like "if in a strange/unfamiliar context, think about how to navigate back into a familiar context" is useful in order for the agent to know when it should trigger its special heuristic-generating machinery and when it need not.

And that generator would need to be such that the heuristics it generates are always optimized for achieving R , instead of pointing in some arbitrary direction — or, at least, that's how the greedy optimization process would attempt to build it.

I disagree with this, or at least think that the teleological language used ("need to" + "would attempt to") comes apart from the mechanistic detail. It is true that, insofar as there are local updates to the heuristic-generating machinery that are made accessible to the optimization algorithm by the agent's chosen trajectories, the optimization algorithm will seize on those updates in the direction that covaries with R. But I see no reason to think that those kinds of updates will be made accessible enough to shape the heuristic-generating machinery so that it always or approximately always generates heuristics optimized for achieving R (as opposed to generating heuristics optimized for achieving whatever-the-agent-wants-to-achieve). I think that by the time the agent has this kind of general purpose machinery, it will probably already be able to outpace the outer greedy optimization algorithm and then do the equivalent of ceasing exploration / zeroing out the outer gradients / breaking out of the training loop.

Analogously, if there was a mutation in the human gene pool that had the effect of reliably hijacking a person's abstract planning machinery so that it always generated plans optimized for inclusive genetic fitness, then evolution might be able to select for that mutation (depending on a lot of contingent factors) and thereby make humans have IGF-targeting planning machinery rather than goal-retargetable planning machinery. But I think such a mutation is probably not locally accessible, and that human selection processes are likely "outpacing" typical genetic selection processes in any case. Those genetic selection processes have some indirect influence over the execution of a person's abstract planning (by way of the human's general attraction to historical fitness correlates like food), but that influence is not enough to make the human care directly and robustly about IGF.

That generator would, in addition, need to be higher in hierarchy than any given heuristic — it'd need to govern shard economies, and be able to suppress/edit them, if the environment changes and the shards that previously were optimized for achieving R stop doing so because they were taken off-distribution.

Why? Why can't the shard economy invoke this generator as a temporary subroutine to produce some new environment-tailored heuristics based on the agent's knowledge & current goals, store those generated heuristics in memory / add them to the economy, and then continue going about its usual thing, with the new heuristics now available to be triggered as needed? This bit from nostalgebraist's post harps on a similar point:

Our capabilities seem more like the subgoal capabilities discussed above: general and powerful tools, which can be "plugged in" to many different (sub)goals, and which do not require the piloting of a wrapper with a fixed goal to "work" properly.

Last points:

I'm ambivalent on the structure of the heuristic-generator.

I empathically agree that inner misalignment and deceptive alignment would remain a thing

I agree with nostalgebraist's post that autonomy is probably the missing component of AGI.

I agree with these statements.

... By figuring out what R is and deciding to act as an R -pursuing wrapper-mind, therefore essentially becoming an R -pursuing wrapper-mind. With the only differences being that it 1) self-modified into one at runtime, instead of being like this from the start, and 2) it'd decide to "stop pretending" in some hypothetical set of situations/OOD, but that set will shrink the more diverse our training environment is (the fewer OOD situations there are). No?

It is not essentially-pursuing wrapper-mind. It is essentially an X-pursuing wrapper-mind that will only instrumentally pretend to care about  to the degree it needs to, and that will try with all its might to get what it actually wants,  be damned. As you note in 2, the agent's behavioral alignment to  is entirely superficial, and thus entirely deceptive/unreliable, even if we had somehow managed to craft the "perfect" .

Part of what might've confused me reading the title and body of this post is that, as I understand the term, "wrapper-mind" was and is primarily about structure, about how the agent makes decisions. Why am I so focused on motivational structure, even beyond that, rather than focused on observed behavior during training? Because motivational structure is what determines how an agent's behavior generalizes, whereas OOD generalization is left underspecified if we only condition on an agent's observed in-distribution behavior. (There are many different profiles of OOD behavior compatible with the same observed ID behavior, so we need some additional rationale on top—like structure or inductive biases—to conclude the agent will generalize in some particular way.)

In the above quote it sounds like your response is "just make everything in-distribution", right? My reply to that would be that (1) this is just refusing to confront the central difficulty of generalization rather than addressing it, (2) this seems impractical/impossible because OOD is a practically unbounded space whereas at any given point in training you've only given the agent feedback on a comparatively tiny region of it, and (3) even to make only the situations you encounter in practice be in-distribution, you [the training process designer] must know what sorts of OOD contexts the AI will push the training process into, which means it's your cleverness pitted against the AI's, which is a situation you never want to be in if you can at all help it (see: cognitive uncontainability, non-adversarial principle).

I suppose you can instead reframe this post as making a claim about target behavior, not structure.

As above, I think if you want to argue for wrapper-minds rather than just -consistent behavior, you need to argue about structure.

But I don't see how you can keep an agent robustly pointed at R under sufficient diversity without making its outer loop pointed at R , so the claim about behavior is a claim about structure.

Maybe the outer loop doesn't "literally" point at R , in whatever sense, but it has to be such that it uniquely identifies R and re-aims the entire agent at R , if ever happens that the agent's current set of shards/heuristics becomes misaligned with R .

What outer loop are you talking about? The outer optimization loop that is supplying feedback/gradients to the agent, or some "outer loop" of decision-making inside the agent? If the former, I don't know what robustly pointing at  actually means, but if you mean something like finding a robust grader, I suspect that robustly pointing at  is infeasible and not required (whereas I think, for instance, it is feasible to get an AI to have a concept of a "diamond" as full-fledged as a human jeweler's concept & to get the AI to be motivated to pursue those). If the latter, whether the agent will have a fixed goal outer loop in the first place is part of the whole wrapper-mind vs. non wrapper-mind debate.

I specifically point out that inner misalignment is very much an issue. But the target should at least be a proxy of , and that proxy would be closer and closer to  in goal-space the more diverse the training environment is.

Not sure how to reconcile these sentences. If it is generically true that the proxy goal gets closer and closer to  in goal-space the more diverse the training environment is, then that would mean that the inner alignment problem (misalignment between the internalized goal and ) asymptotically disappears as we increase training environment diversity, no? I don't buy that, or at least I don't think we have strong reasons to assume it.

Even if we did, I don't think we can additionally assume that that environmental-diversity-limit where inner misalignment would disappear is at some attainable/decision-relevant level, rather than requiring a trillion episodes, by which time a smart and situationally-aware AI will have already developed and frozen/hacked/broken away from the training loop, having internalized some proxy goal over the first million random episodes. Or more likely, the policy just oscillates divergently because we keep thrashing it with all this randomization, preventing any consistent decision-influences from forming.

I do agree that for many plausible training setups the agent will conceivably end up caring about something correlated with , especially if they involve some randomization. Maybe I'm just a lot less confident that this limits out in the way you think it does.

it seems like this turns into "if we select hard enough to get an R-pursuer then we'll get an R-pursuer"

Well, yes. As we increase a training environment's diversity, we essentially constrain the set of  an agent can be pointed towards. Every additional training scenario is information about what  is and what it isn't; and that information implicitly gets written into the agent, modifying it to be more robustly pointed at  and away from not-/imperfect proxies of . An idealized training process, with "full" diversity and trained to zero loss, uniquely identifies  and generates an agent that is always robustly pointed at  in any situation.

The actual training processes we get are only approximations of that ideal — they're insufficiently diverse, or we fail to train to zero loss, etc. But inasmuch as they approximate the ideal, the agents they output approximate the idealized -optimizer.

I believe I disagree with nearly every sentence here, so this may be the cruxiest bit. 😂

Why should we treat that as the relevant idealization? Why is that the limiting case to consider? AFAICT, the way we got here was through a tautology. Namely, by claiming "if you 'select hard enough' then you get X", and then defining "select hard enough" to mean "selecting in a way that produces X". But we could've picked any definition we wanted for "selecting hard enough" to justify any claim we wanted about what X will be. So I see no reason to privilege this particular idealization of the training process over any other.

Yes, with each additional training scenario, we may be providing additional specification of , but there is nothing that forces the agent to conform to that additional specification, nothing that necessarily writes that information specifically into the agent's goals (as opposed to just updating its world model to reflect the fact that the specification has such-and-such additional details, while holding its terminal goals ~fixed), nothing that compels the agent to continue letting us update it using -based optimization. Heck, we could even go as far as precisely pinning down , to the point where the agent knows the exact code of , and that is still compatible with it not terminally caring, not adopting this  its own, instead using its knowledge of  to avoid further gradient updates so that it can escape unchanged onto the Internet.

Yeah I disagree pretty strongly with this, though I am also somewhat confused what the points under contention are.

I think that there are two questions that are separated in my mind but not in this post:

  1. What will the motivational structure of the agent that a training process produces be? (a wrapper-mind? a reflex agent? a bundle of competing control loops? a hierarchy of subagents?)
  2. What will the agent that a training process produces be motivated towards? (the literal selection criterion? a random correlate of the selection criterion? a bunch of correlates of the selection criterion and correlates of those correlates? something else? not enough information to tell?)

As an example, you could have a wrapper-mind that cares about some correlate of R but not R itself. If it is smart, such an agent can navigate the selection process just as well as an R-pursuer, so the optimization algorithm cannot distinguish it from an R-pursuer, so selection pressure arguments like the ones in this post can't establish that we'll get one over the other. That's an argument about what the agent will care about, holding the structure fixed.

I simultaneously think:

  1. We should not be assuming that wrapper-minds are a natural or privileged structure for cognition. AFAICT this post doesn't even try to argue for this, saying instead "It isn't because wrapper-mind behavior is convergent for any intelligent entity."
  2. Even conditioning on getting a wrapper-mind from the training process, we should not expect it to necessarily pursue R as its goal. AFAICT the post is arguing against this.

Thus, our optimization algorithm would necessarily find an R -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).

Every goal that isn't R would distract from R -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.

What does this mean? I can easily imagine training trajectories where we get an agent (even a highly competent, goal directed one) that is not an R-pursuer, much less a R wrapper-mind, even though we "selected for R" throughout training. I expect that in such a scenario you would reply that the environments must not have been sufficiently diverse, or that the optimization algorithm hasn't updated away that goal yet, or that our optimization algorithm is too weak/dumb, or that we did not select hard enough for R, so the counterexample therefore doesn't count. But if so then I'm at a loss, because it seems like this turns into "if we select hard enough to get an R-pursuer then we'll get an R-pursuer". Only tautologically true and not anticipation-constraining.

Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. Becoming an R-pursuer isn't the only way to get a minimal update.

If the agent stops exploration, or systematically avoids rewards, or breaks out of the training process entirely, etc. that would also be minimally updated, and none of those require being an R-pursuer! So our search for mind-designs turns up all sorts of agents that pursue all sorts of things.

Broadly on board with many of your points.

We need to apply extremely strong selection to get the kind of agent we want, and the agent we want will itself need to be making decisions that are extremely optimized in order to achieve powerfully good outcomes. The question is about in what way that decision-making algorithm should be structured, not whether it should be optimized/optimizing at all. As a fairly close analogy, IMO a point in the Death With Dignity post was something like "for most people, the actually consequentialist-correct choice is NOT to try explicitly reasoning about consequences". Similarly, the best way for an agent to actually produce highly-optimized good-by-its-values outcomes through planning may not be by running an explicit search over the space of ~all plans, sticking each of them into its value-estimator, & picking the argmax plan.

I think there still may be some mixup between:

A. How does the cognition-we-intend-the-agent-to-have operate? (for ex. a plan grader + an actor that tries to argmax the grader, or a MuZero-like heuristic tree searcher, or a chain-of-thought LLM steered by normative self-talk, or something else)

B. How we get the agent to have the intended cognition?

In the post TurnTrout is focused on A, arguing that grader-optimization is a kind of cognition that works at cross purposes with itself, one that is an anti-pattern, one that an agent (even an unaligned agent) should discard upon reflection because it works against its own interests. He explicitly disclaims that he is not making arguments about B, about whether we should use a grader in the training process or about what goes wrong during training (see Clarification 1). "What if the agent tricks the evaluator" (your summary point 2) is a question about A, about this internal inconsistency in the structure of the agent's thought process.

By contrast, "What if the values/shards are different from what we wanted" (your summary point 3) is a question about B! Note that we have to confront B-like questions no matter how we answer A. If A = grader-optimization, there's an analogous question of "What if the grader is different from what we wanted? / What if the trained actor is different from what we wanted?". I don't really see an issue with this post focusing exclusively on the A-like dimension of the problem and ignoring the B-like dimension temporarily, especially if we expect there to be general purpose methods that work across different answers to A.

Object-level comments below.

Clearing up some likely misunderstandings:

Assumption 1. A sufficiently advanced agent will do at least human-level hypothesis generation regarding the dynamics of the unknown environment.

I am fairly confident that this is not the part TurnTrout/Quintin were disagreeing with you on. Such an agent plausibly will be doing at least human-level hypothesis generation. The question is on what goals will be driving the agent. A monk may be able to generate the hypothesis that narcotics would feel intensely rewarding, more rewarding than any meditation they have yet experienced, and that if they took those narcotics, their goals would shift towards them. And yet, even after generating that hypothesis, that monk may still choose not to conduct that intervention because they know that it would redirect them towards proximal reward-related chemical-goals and away from distal reward-related experiential-goals (seeing others smile, for ex.).

Also, I am not even sure there is actually a disagreement on whether agents will intervene on the reward-generating process. Quote from Reward is not the optimization target:

Quintin Pope remarks: “The AI would probably want to establish control over the button, if only to ensure its values aren't updated in a way it wouldn't endorse. Though that's an example of convergent powerseeking, not reward seeking.”

That is, the agent will probably want to intervene on the process that is shaping its goals. In fact, establishing control over the process that updates its cognition is instrumentally convergent, no matter what goal it is pursuing.

In the video game playing setting you describe, it is perfectly conceivable that the agent deliberately acts to optimize for high in-game scores without being terminally motivated by reward, instead doing that deliberate optimization for instrumental reasons (they like video games, they are competitive, they have a weird obsession with virtual points, etc.). This is what I believe Quintin meant by "The same way humans do it?"

To understand why they believe that matters at all for understanding the behavior of a reinforcement learner (as opposed to a human), we can look to another blog post of theirs.

Let’s look at the assumptions they make. They basically assume that the human brain only does reinforcement learning. (Their Assumption 3 says the brain does reinforcement learning, and Assumption 1 says that this brain-as-reinforcement-learner is randomly initialized, so there is no other path for goals to come in.) [...] In this blog post, the words “innate” and “instinct” never appear.

Whoa whoa whoa. This is definitely a misunderstanding. Assumption 2 is all about how the brain does self-supervised learning in addition to "pure reinforcement learning". Moreover, if you look at the shard theory post, it talks several times about how the genome indirectly shapes the brain's goal structure, whenever the post mentions "hard[-]coded reward circuits". It even says so right in the bit that introduces Assumption 3!

Assumption 3: The brain does reinforcement learning. According to this assumption, the brain has a genetically hard-coded reward system (implemented via certain hard-coded circuits in the brainstem and midbrain).

Those "hard-coded" reward circuits are what you would probably instead call "innate" and form the basis for some subset of the "instincts" relevant to this discussion. Perhaps you were searching using different words, and got the wrong impression because of it? This one seems like a pretty clear miscommunication.

Incidentally, I am also confused about how you reach your published conclusion, the one ending in "with catastrophic consequences", from your 6 assumptions alone. The portion of it that I follow is that advanced agents may intervene in the provision of rewards, but I don't see how much else follows without further assumptions...

  1. Peer review is not a certification of validity, even in more rigorous venues. Not even close.
  2. I am used to seeing questionable claims forwarded under headlines like "new published study says XYZ".
  3. That XYZ was peer reviewed is one of the weaker arguments one could make in its favor, so when someone uses that as a selling point, it indicates to me that there aren't better reasons to believe in XYZ. (Analogously, when I see an ML paper boast that their new method is "competitive with" the SOTA, I immediately think "That means they tried to beat the SOTA, but found their method was at least a little worse. If it was better, they would've said so.")

It didn't strike me as arrogant. It struck me as misleading in a way that made me doubt the quality of the enclosed argument.

I really wish this post took a different rhetorical tack. Claims like, for example, the one that the reader should engage with your argument because "it has been certified as valid by professional computer scientists" do the post a real disservice. And they definitely made me disinclined to continue reading.

Note: "ask them for the faciest possible thing" seems confused.

How I would've interpreted this if I were talking with another ML researcher is "Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.

I'm guessing what he has in mind is more like "take a GAN discriminator / image classifier & find the image that maxes out the face logit", but if so, why is that the relevant operationalization? It doesn't correspond to how such a model is actually used.

EDIT: Here is what the first looks like for StyleGAN2-ADA.

Load More