Paul Christiano


Iterated Amplification


Agency in Conway’s Game of Life

It seems like our physics has a few fundamental characteristics that change the flavor of the question:

  • Reversibility. This implies that the task must be impossible on average---you can only succeed under some assumption about the environment (e.g. sparsity).
  • Conservation of energy/mass/momentum (which seem fundamental to the way we build and defend structures in our world).

I think this is an interesting question, but if poking around it would probably be nicer to work with simple rules that share (at least) these features of physics.

Low-stakes alignment

I was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization.

This certainly raises a lot of questions though---what form do these states take? How do I specify a reward function that takes as input a state of the world?

I agree that "actually trying" is still hard to define, though you could avoid that messiness by saying that the goal is to provide a reward such that any optimal policy for that reward would be beneficial / aligned (and then the assumption is that a policy that is "actually trying" to pursue the objective would not do as well as the optimal policy but would not be catastrophically bad).

I'm also quite scared of assuming optimality. For example, doing so would assume away sample complexity and would open up whole strategies (like arbitrarily big debate trees or debates against random opponents who happen to sometimes give good rebuttals) that I think should be off limits for algorithmic reasons regardless of the environment (and some of which are dead ends with respect to the full problem).

It feels like the low-stakes setting is also mostly assuming away embeddedness problems? I suppose it still includes e.g. cases where the AI system subtly changes the designer's preferences over the course of training, but it excludes e.g. direct modification of the reward, taking over the training process, etc.

I feel like low-stakes makes a plausible empirical assumption under which it turns out to be possible to ignore many of the problems associated with embededness (because in fact the reward function is protected from tampering). But I'm much more scared about issues the other consequences of assuming a cartesian boundary (where e.g. I don't even know the type signatures of the objects involved any more).

A way you could imagine this going wrong, that feels scary in the same way as the alternative problem statements, is if "are the decisions low stakes?" is a function of your training setup, so that you could unfairly exploit the magical "reward functions can't be tampered with" assumption to do something unrealistic.

But part of why I like the low stakes assumption is that it's about the problem you face. We're not assuming that every reward function can't be tampered with, just that there is some real problem in the world that has low stakes. If your algorithm introduces high stakes internally then that's your problem and it's not magically assumed away.

This isn't totally fair because the utility function U in the low-stakes definition depends on your training procedure, so you could still be cheating. But I feel much, much better about it.

I think this is basically what you were saying with:

That being said, one way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem).

That seems like it might capture the core of why I like the low-stakes assumption. It's what makes it so that you can't exploit the assumption in an unfair way, and so that your solutions aren't going to systematically push up against unrealistic parts of the assumption.

Low-stakes alignment

Sounds good/accurate.

It seems like there are other ways to get similarly clean subproblems, like "assume that the AI system is trying to optimize the true reward function".

My problem with this formulation is that it's unclear what "assume that the AI system is trying to optimize the true reward function" means---e.g. what happens when there are multiple reasonable generalizations of the reward function from the training distribution to a novel input?

I guess the natural definition is that we actually give the algorithm designer a separate channel to specify a reward function directly rather than by providing examples. It's not easy to do this (note that a reward function depends on the environment, it's not something we can specify precisely by giving code, and the details about how the reward is embedded in the environment are critical; also note that "actually trying" then depends on the beliefs of the AI in a way that is similarly hard to define), and I have some concerns with various plausible ways of doing that.

AMA: Paul Christiano, alignment researcher

I think most people have expectations regarding e.g. how explicitly will systems represent their preferences, how much will they have preferences, how will that relate to optimization objectives used in ML training, how well will they be understood by humans, etc.

Then there's a bunch of different things you might want: articulations of particular views on some of those questions, stories that (in virtue of being concrete) show a whole set of guesses and how they can lead to a bad or good outcome, etc. My bullet points were mostly regarding the exercise of fleshing out a particular story (which is therefore most likely to be wrong), rather than e.g. thinking about particular questions about the future.

AMA: Paul Christiano, alignment researcher

Don't read too much into it. I do dislike Boston weather.

AMA: Paul Christiano, alignment researcher

On that perspective I guess by default I'd think of a threat as something like "This particular team of hackers with this particular motive" and a threat model as something like "Maybe they have one or two zero days, their goal is DoS or exfiltrating information, they may have an internal collaborator but not one with admin privileges..." And then the number of possible threat models is vast even compared to the vast space of threats.

AMA: Paul Christiano, alignment researcher

I mostly don't think this thing is a major issue. I'm not exactly sure where I disagree, but some possibilities:

  • H isn't some human isolated from the world, it's an actual process we are implementing (analogous to the current workflow involving external contractors, lots of discussion about the labeling process and what values it might reflect, discussions between contractors and people who are structuring the model, discussions about cases where people disagree)
  • I don't think H is really generalizing OOD, you are actually collecting human data on the kinds of questions that matter (I don't think any of my proposals rely on that). So the scenario you are talking about is something like the actual people who are implementing H---real people who actually exist and we are actually working with---are being offered payments or extorted or whatever by the datapoints that the actual ML is giving them. That would be considered a bad outcome on many levels (e.g. man that sounds like it's going to make the job stressful), and you'd be flagging models that systematically produce such outputs (if all is going well they shouldn't be upweighted), and coaching contractors and discussing the interesting/tricky cases and so on.
  • H is just not making that many value calls, they are mostly implemented by the process that H answers. Similarly, we're just not offloading that much of the substantive work to H (e.g. they don't need to be super creative or wise, we are just asking them to help construct a process that responds appropriately to evidence).
  • I don't really know what kind of opportunity cost you have in mind. Yes, if we hire contractors and can't monitor their work they will sometimes do a sloppy job. And indeed if someone from an ML team is helping run an oversight process there might be some kinds of inputs where they don't care and slack off? But there seems to be a big mismatch between the way this scenario is being described and a realistic process for producing of training data.
  • Most of the errors that H might make don't seem like they contribute to large-scale consequentialist behavior within HCH, and mostly just doesn't seem like a big deal or serious problem. We think a lot about kinds of errors that H might make that aren't noise, e.g. systematic divergences between what contractors do and what we want them to do, and it seems easy for them to be worse than random (and that's something we can monitor) but there's a lot of room between that and "undermines benignness."

Overall it seems like the salient issue is whether sufficiently ML-optimized outputs can lead to malign behavior by H (in which case it is likely also leading to crazy stuff in the outside world), but I don't think that motivational issues for H are a large part of the story (those cases would be hard for any humans, and this is a smaller source of variance than other kinds of variation in H's competence or our other tools for handling scary dynamics in HCH).

AMA: Paul Christiano, alignment researcher

No idea other than playing a bunch of games (might as well current version, old dailies probably best) and maybe looking at solutions when you get stuck. Might also just run through a bunch of games and highlight the main important interactions and themes for each of them, e.g. Innovation + Public Works + Reverberate or Hatchery + Till. I think on any given board (and for the game in general) it's best to work backwards from win conditions, then midgames, and then openings.

AMA: Paul Christiano, alignment researcher

We'll do the cost-benefit analysis and over time it will look like a good career for a smaller and smaller fraction of people (until eventually basically everyone for whom it looks like a good idea is already doing it).

That could kind of qualitatively look like "something else is more important," or "things kind of seem under control and it's getting crowded," or "there's no longer enough money to fund scaleup." Of those, I expect "something else is more important" to be the first to go (though it depends a bit on how broadly you interpret "from AI," if anything related to the singularity / radically accelerating growth is classified as "from AI" then it may be a core part of the EA careers shtick kind of indefinitely, with most of the action in which of the many crazy new aspects of the world people are engaging with).

AMA: Paul Christiano, alignment researcher

I've created 3 blogs in the last 10 years and 1 blog in the preceding 5 years. It seems like 1-2 is a good guess. (A lot depends on whether there ends up being an ARC blog or it just inherits

Load More