Abram Demski


Pointing at Normativity
Consequences of Logical Induction
Partial Agency
Alternate Alignment Ideas
Embedded Agency


I think that's not true. The point where you deal with wireheading probably isn't what you reward so much as when you reward. If the agent doesn't even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing. 

I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that it's closer in practice to "all the hypotheses are around at the beginning" -- it doesn't matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don't change that much by introducing it at different stages in training.

Plausibly this is true of some training setups and not others; EG, more true for LLMs and less true for RL.

Let's set aside the question of whether it's true, though, and consider the point you're making.

This isn't a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models).

So I understand one of your major points to be: thinking about training as the chisel which shapes the policy doesn't necessitate thinking in terms of incentives (ie gradients pushing in particular directions). The ultimate influence of a gradient isn't necessarily the thing it immediately pushes for/against.

I tentatively disagree based on the point I made earlier; plausibly the influence of a gradient step is almost exclusively its immediate influence.

But I don't disagree in principle with the line of investigation. Plausibly it is pretty important to understand this kind of evidence-ordering dependence. Plausibly, failure modes in value learning can be avoided by locking in specific things early, before the system is "sophisticated enough" to be doing training-process-simulation. 

I'm having some difficulty imagining powerful conceptual tools along those lines, as opposed to some relatively simple stuff that's not that useful. 

And one reason is that I don't think that RL agents are managing motivationally-relevant hypotheses about "predicting reinforcements." Possibly that's a major disagreement point? (I know you noted its fuzziness, so maybe you're already sympathetic to responses like the one I just gave?)

I'm confused about what you mean here. My best interpretation is that you don't think current RL systems are modeling the causal process whereby they get reward. On my understanding, this does not closely relate to the question of whether our understanding of training should focus on the first-order effects of gradient updates or should also admit higher-order, longer-term effects.

Maybe on your understanding, the actual reason why current RL systems don't wirehead too much, is because of training order effects? I would be surprised to come around on this point. I don't see it.

I expect this argument to not hold, 

Seems like the most significant remaining disagreement (perhaps).

1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the "distance covered" to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)

So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient to go against it (favoring non-training-modeling hypotheses) because non-training-process-modelers are simpler.

This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want).[1]

  • I think solomonoff-style program simplicity probably doesn't do it; the simplest program fitting with a bunch of data from our universe quite plausibly models our universe. 
  • I think circuit-simplicity doesn't do it; simple circuits which perform complex tasks are still more like algorithms than lookup tables, ie, still try to model the world in a pretty deep way. 
  • I think Vanessa has some interesting ideas on how infrabayesian-physicalism might help deal with inner optimizers, but on my limited understanding, I think not by ruling out training-process-modeling.

In other words, it seems to me like a tough argument to make, which on my understanding, no one has been able to make so far, despite trying; but, not an obviously wrong direction.

2. You're always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size." 

I don't really see your argument here? How does (identifiability issues  (argument is wrong  training-process-optimization is unavoidable  we can somehow make it not apply to networks of AGI size))?

In my personal estimation, shaping NNs in the right way is going to require loss functions which open up the black box of the NN, rather than only looking at outputs. In principle this could eliminate identifiability problems entirely (eg "here is the one correct network"), although I do not fully expect that.

A 'good prior' would also solve the identifiability problem well enough. (eg, if we could be confident that a prior promotes non-deceptive hypotheses over similar deceptive hypotheses.)

But, none of this is necc. interfacing with your intended argument.

3. Even if the agent is motivated both by the training process and by the object-level desired hypothesis (since the gradients would reinforce both directions), on shard theory, that's OK, an agent can value several things. The important part is that the desired shards cast shadows into both the agent's immediate behavior and its reflectively stable (implicit) utility function.

Here's how I think of this part. A naïve EU-maximizing agent, uncertain between two hypotheses about what's valuable, might easily decide to throw one under the bus for the other. Wireheading is analogous to a utility monster here -- something that the agent is, on balance, justified to throw approximately all its resources at, basically neglecting everything else.

A bargaining-based agent, on the other hand, can "value several things" in a more significant sense. Simple example: 

  •  and  are almost equally probable hypotheses about what to value.
  • EU maximization maximizes whichever happens to be slightly more probable.
  • Nash bargaining selects a 50-50 split between the two, instead, flipping a coin to fairly divide outcomes.

In order to mitigate risks due to bad hypotheses, we want more "bargaining-like" behavior, rather than "EU-like" behavior.

I buy that bargaining-like behavior fits better flavor-wise with shard theory, but I don't currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default, if that's part of your intended implication?

  1. ^

    We were discussing wireheading, not inner optimization, but a wireheading agent that hides this in order to do a treacherous turn later is a deceptive inner optimizer. I'm not going to defend the inner/outer distinction here; "is wireheading an inner alignment problem, or an outer alignment problem?" is a problematic question. 

My main complaint with this, as I understand it, is that builder/breaker encourages you to repeatedly condition on speculative dangers until you're exploring a tiny and contorted part of solution-space (like worst-case robustness hopes, in my opinion). And then you can be totally out-of-touch from the reality of the problem.

On my understanding, the thing to do is something like heuristic search, where "expanding a node" means examining that possibility in more detail. The builder/breaker scheme helps to map out heuristic guesses about the value of different segments of the territory, and refine them to the point of certainty.

So when you say "encourages you to repeatedly condition on speculative dangers until you're exploring a tiny and contorted part of solution-space", my first thought is that you missed the part where Builder can respond to Breaker's purported counterexamples with arguments such as the ones you suggest:

I currently conjecture that [...]

Does this argument fail? Maybe, yeah! Should I keep that in mind? Yes! But that doesn't necessarily mean I should come up with an extremely complicated scheme to make feedback-modeling be suboptimal.

But, perhaps more plausibly, you didn't miss that point, and are instead pointing to a bias you see in the reasoning process, a tendency to over-weigh counterexamples as if they were knockdown arguments, and forget to do the heuristic search thing where you go back and expand previously-unpromising-seeming nodes if you seem to be getting stuck in other places in the tree.

I'm tempted to conjecture that you should debug this as a flaw in how I apply builder/breaker style reasoning, as opposed to the reasoning scheme itself -- why should builder/breaker be biased in this way?

You seem to address a related point:

One might therefore protest: "Worst-case reasoning is not suitable for deconfusion work! We need a solid understanding of what's going on, before we can do robust engineering."

However, it's also possible to use Builder/Breaker in non-worst-case (ie, probabilistic) reasoning. It's just a matter of what kind of conclusion Builder tries to argue. If Builder argues a probabilistic conclusion, Builder will have to make probabilistic arguments.

But then you later say: 

Point out implausible assumptions via plausible counterexamples.

  • In this case we ask: does the plausibility of the counterexample force the assumption to be less probable than we'd like our precious few assumptions to be?

But if we're still in the process of deconfusing the problem, this seems to conflate the two roles. If game day were tomorrow and we had to propose a specific scheme, then we should indeed tally the probabilities. 

I admit that I do not yet understand your critique at all -- what is being conflated?

Here is how I see it, in some detail, in the hopes that I might explicitly write down the mistaken reasoning step which you object to, in the world where there is such a step.

  1. We have our current beliefs, and we can also refine those beliefs over time through observation and argument.
  2. Sometimes it is appropriate to "go with your gut", choosing the highest-expectation plan based on your current guesses. Sometimes it is appropriate to wait until you have a very well-argued plan, with very well-argued probabilities, which you don't expect to easily move with a few observations or arguments. Sometimes something in the middle is appropriate.
  3. AI safety is in the "be highly rigorous" category. This is mostly because we can easily imagine failure being so extreme that humanity in fact only gets one shot at this.
  4. When the final goal is to put together such an argument, it makes a lot of sense to have a sub-process which illustrates holes in your reasoning by pointing out counterexamples. It makes a lot of sense to keep a (growing) list of counterexample types.
  5. It being virtually impossible to achieve certainty that we'll avert catastrophe, our arguments will necessarily include probabilistic assumptions and probabilistic arguments.
  6. #5 does not imply, or excuse, heuristic informality in the final arguments; we want the final arguments to be well-specified, so that we know precisely what we have to assume and precisely what we get out of it. 
  7. #5 does, however, mean that we have an interest in plausible counterexamples, not just absolute worst-case reasoning. If I say (as Builder) "one of the coin-flips will come out heads", as part of an informal-but-working-towards-formality argument, and Breaker says "counterexample, they all come out tails", then the right thing to do is to assess the probability. If we're flipping 10 coins, maybe Breaker's counterexample is common enough to be unacceptably worrying, damning the specific proposal Builder was working on. If we're flipping billions of coins, maybe Breaker's counterexample is not probable enough to be worrying.

This is the meaning of my comment about pointing out insufficiently plausible assumptions via plausible counterexamples, which you quote after "But then later you say:", and of which you state that I seem to conflate two roles.

But if we're assessing the promise of a given approach for which we can gather more information, then we don't have to assume our current uncertainty. Like with the above, I think we can do empirical work today to substantially narrow the uncertainty on that kind of question.[1] That is, if our current uncertainty is large and reducible (like in my diamond-alignment story), breaker might push me to prematurely and inappropriately condition on not-that-proposal and start exploring maybe-weird, maybe-doomed parts of the solution space as I contort myself around the counterarguments.

I guess maybe your whole point is that the builder/breaker game focuses on constructing arguments, while in fact we can resolve some of our uncertainty through empirical means. 

On my understanding, if Breaker uncovers an assumption which can be empirically tested, Builder's next move in the game can be to go test that thing. 

However, I admit to having a bias against empirical stuff like that, because I don't especially see how to generalize observations made today to the highly capable systems of the future with high confidence.

WRT your example, I intuit that perhaps our disagreement has to do with ...

I currently conjecture that an initialization from IID self-supervised- and imitation-learning data will not be modelling its own training process in detail, 

I think it's pretty sane to conjecture this for smaller-scale networks, but at some point as the NN gets large enough, the random subnetworks already instantiate the undesired hypothesis (along with the desired one), so they must be differentiated via learning (ie, "incentives", ie, gradients which actually specifically point in the desired direction and away from the undesired direction).

I think this is a pretty general pattern -- like, a lot of your beliefs fit with a picture where there's a continuous (and relatively homogenious) blob in mind-space connecting humans, current ML, and future highly capable systems. A lot of my caution stems from being unwilling to assume this, and skeptical that we can resolve the uncertainty there by empirical means. It's hard to empirically figure out whether the landscape looks similar or very different over the next hill, by only checking things on this side of the hill. 

  1. ^

    Ideally, nothing at all; ie, don't create powerful AGI, if that's an option. This is usually the correct answer in similar cases. EG, if you (with no training in bridge design) have to deliver a bridge design that won't fall over, drawing up blueprints in one day's time, your best option is probably to not deliver any design. But of course we can arrange the thought-experiment such that it's not an option.

The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

I think that it generally seems like a good idea to have solid theories of two different things:

  1. What is the thing we are hoping to teach the AI?
  2. What is the training story by which we mean to teach it?

I read your above paragraph as maligning (1) in favor of (2). In order to reinforce the desired shards, it seems helpful to have some idea of what those look like

For example, if we avoid fancy philosophical frameworks, we might think a good way to avoid wireheading is to introduce negative examples where the AI manipulates circuitry to boost reinforcement signals, and positive examples where the AI doesn't do that when given the opportunity. After doing some philosophy where you try to positively specify what you're trying to train, it's easier to notice that this sort of training still leaves the human-manipulation failure mode open.

After doing this kind of philosophy for a while, it's intuitive to form the more general prediction that if you haven't been able to write down a formal model of the kind of thing you're trying to teach, there are probably easy failure modes like this which your training hasn't attempted to rule out at all.

I said: 

The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans.
In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.

You said: 

In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for people).

Thinking about this now, I think maybe it's a question of precautions, and what order you want to teach things in. Very similarly to the argument that you might want to make a system corrigible first, before ensuring that it has other good properties -- because if you make a mistake, later, a corrigible system will let you correct the mistake.

Similarly, it seems like a sensible early goal could be 'get the system to understand that the sort of thing it is trying to do, in (value) learning, is to pick up human values'. Because once it has understood this point correctly, it is harder for things to go wrong later on, and the system may even be able to do much of the heavy lifting for you.

Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn't seem like a very correctible approach. (I think much of this pessimism comes from mentally visualizing humans arguing about what object-level values to try to teach an AI. Even if the humans are able to agree, I do not feel especially optimistic about their choices, even if they're supposedly informed by neuroscience and not just moral philosophers.)

If you commit to the specific view of outer/inner alignment, then now you also want your loss function to "represent" that goal in some way.

I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases -- or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.

This is because writing down the desired inductive biases as an explicit prior can help us to understand what's going on better. 

It's tempting to say that to understand how the brain learns, is to understand how it treats feedback as evidence, and updates on that evidence. Of course, there could certainly be other theoretical frames which are more productive. But at a deep level, if the learning works, the learning works because the feedback is evidence about the thing we want to learn, and the process which updates on that feedback embodies (something like) a good prior telling us how to update on that evidence.

And if that framing is wrong somehow, it seems intuitive to me that the problem should be describable within that ontology, like how I think "utility function" is not a very good way to think about values because what is it a function of; we don't have a commitment to a specific low-level description of the universe which is appropriate for the input to a utility function. We can easily move beyond this by considering expected values as the "values/preferences" representation, without worrying about what underlying utility function generates those expected values.

(I do not take the above to be a knockdown argument against "committing to the specific division between outer and inner alignment steers you wrong" -- I'm just saying things that seem true to me and plausibly relevant to the debate.)

I doubt this due to learning from scratch.

I expect you'll say I'm missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is "figuring out the human prior", because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I'm fine with that. I am not dogmatic on orthodox bayesianism. I do not even like utility functions.

Insofar as the question makes sense, its answer probably takes the form of inductive biases: I might learn to predict the world via self-supervised learning and form concepts around other people having values and emotional states due to that being a simple convergent abstraction relatively pinned down by my training process, architecture, and data over my life, also reusing my self-modelling abstractions.

I am totally fine with saying "inductive biases" instead of "prior"; I think it indeed pins down what I meant in a more accurate way (by virtue of, in itself, being a more vague and imprecise concept than "prior").

I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent's own cognition. I don't think you need anything mysterious for the latter. I'm confident that RLHF, done skillfully, does the job just fine. The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

I think I don't understand what you mean by (2), and as a consequence, don't understand the rest of this paragraph?

WRT (1), I don't think I was being careful about the distinction in this post, but I do think the following:

The problem of wireheading is certainly not that RL agents are trying to take control of their reward feedback by definition; I agree with your complaint about Daniel Dewey as quoted. It's a false explanation of why wireheading is a concern.

The problem of wireheading is, rather, that none of the feedback the system gets can disincentivize (ie, provide differentially more loss for) models which are making this mistake. To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can't do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models (assuming similar competence levels in both, of course, which I admit is a bit fuzzy).

This doesn't seem relevant for non-AIXI RL agents which don't end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?

With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.

Output-based evaluation (including supervised learning, and the most popular forms of unsupervised learning, and a lot of other stuff which treats models as black boxes implementing some input/output behavior or probability distribution or similar) can't distinguish between a model which is internalizing the desired concepts, vs a model which is instead modeling the actual feedback process instead. These two do different things, but not in a way that the feedback system can differentiate.

In terms of shard theory, as I understand it, the point is that (absent arguments to the contrary, which is what we want to be able to construct), shards that implement feedback-modeling like this cannot be disincentivized by the feedback process, since they perform very well in those terms. Shards which do other things may or may not be disincentivized, but the feedback-modeling shards (if any are formed at any point) definitely won't, unless of course they're just not very good at their jobs. 

So the problem, then, is: how do we arrange training such that those shards have very little influence, in the end? How do we disincentivize that kind of reasoning at all? 

Plausibly, this should only be tackled as a knock-on effect of the real problem, actually giving good feedback which points in the right direction; however, it remains a powerful counterexample class which challenges many many proposals. (And therefore, trying to generate the analogue of the wireheading problem for a given proposal seems like a good sanity check.)

I'm a bit uncomfortable with the "extreme adversarial threats aren't credible; players are only considering them because they know you'll capitulate" line of reasoning because it is a very updateful line of reasoning. It makes perfect sense for UDT and functional decision theory to reason in this way. 

I find the chicken example somewhat compelling, but I can also easily give the "UDT / FDT retort": since agents are free to choose their policy however they like, one of their options should absolutely be to just go straight. And arguably, the agent should choose that, conditional on bargaining breaking down (precisely because this choice maximizes the utility obtained in fact -- ie, the only sort of reasoning which moves UDT/FDT). Therefore, the coco line of reasoning isn't relying on an absurd hypothetical. 

Another argument for this perspective: if we set the disagreement point via Nash equilibrium, then the agents have an extra incentive to change their preferences before bargaining, so that the Nash equilibrium is closer to the optimal disagreement point (IE the competition point from coco). This isn't a very strong argument, however, because (as far as I know) the whole scheme doesn't incentivize honest reporting in any case. So agents may be incentivised to modify their preferences one way or another. 

Reflect Reality?

One simple idea: the disagreement point should reflect whatever really happens when bargaining breaks down. This helps ensure that players are happy to use the coco equilibrium instead of something else, in cases where "something else" implies the breakdown of negotiations. (Because the coco point is always a pareto-improvement over the disagreement point, if possible -- so choosing a realistic disagreement point helps ensure that the coco point is realistically an improvement over alternatives.)

However, in reality, the outcome of conflicts we avoid remain unknown. The realist disagreement point is difficult to define or measure if in reality agreement is achieved.

So perhaps we should suppose that agreement cannot always be reached, and base our disagreement point on the observed consequences of bargaining failure. 

Load More