*This post is the result of work I did with Paul Christiano on the ideas in his “Teaching ML to answer questions honestly instead of predicting human answers” post. In addition to expanding upon what is in that post in terms of identifying numerous problems with the proposal there and identifying ways in which some of those problems can be patched, I think that this post also provides a useful window into what Paul-style research looks like from a non-Paul perspective.*

Recommended prior reading: “A naive alignment strategy and optimisim about generalization” and “Teaching ML to answer questions honestly instead of predicting human answers” (though if you struggled with “Teaching ML to answer questions honestly,” I reexplain things in a more precise way here that might be clearer...

I'm talking about these agents (LW thread here)

I'd love an answer either in operations (MIPS, FLOPS, whatever) or in dollars.

Follow-up question: How many parameters did their agents have?

I just read the paper (incl. appendix) but didn't see them list the answer anywhere. I suspect I could figure it out from information in the paper, e.g. by adding up how many neurons are in their LSTMs, their various other bits, etc. and then multiplying by how long they said they trained for, but I lack the ML knowledge to do this correctly.

Some tidbits from the paper:

For multi-agent analysis we took the final generation of the agent(generation5)andcreatedequallyspacedcheckpoints (copies of the neural network parameters) every 10 billion steps, creating a collection of 13 checkpoints.

This suggests 120 billion steps of...

53dI have a guesstimate for number of parameters, but not for overall compute or
dollar cost:
Each agent was trained on 8 TPUv3's, which cost about $5,000/mo according to a
quick google, and which seem to produce 90 TOPS
[https://en.wikipedia.org/wiki/Tensor_Processing_Unit], or about 10^14
operations per second. They say each agent does about 50,000 steps per second,
so that means about 2 billion operations per step. Each little game they play
lasts 900 steps if I recall correctly, which is about 2 minutes of subjective
time they say (I imagine they extrapolated from what happens if you run the game
at a speed such that the physics simulation looks normal-speed to us). So that
means about 7.5 steps per subjective second, so each agent requires about 15
billion operations per subjective second.
So... 2 billion operations per step suggests that these things are about the
size of GPT-2, i.e. about the size of a rat brain
[https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons]? If we care
about subjective time, then it seems the human brain maybe uses 10^15 FLOP per
subjective second
[https://docs.google.com/document/d/1IJ6Sr-gPeXdSJugFulwIpvavc0atjHGM82QjIfUSBGQ/edit#heading=h.e3k724n81me]
, which is about 5 OOMs more than these agents.

31dDo you mind sharing your guesstimate on number of parameters?
Also, do you have per chance guesstimates on number of parameters / compute of
other systems?

I did, sorry -- I guesstimated FLOP/step and then figured parameters is probably a bit less than 1 OOM less than that. But since this is recurrent maybe it's even less? IDK. My guesstimate is shitty and I'd love to see someone do a better one!

22dMichael Dennis tells me that population-based training typically sees strong
diminishing returns to population size, such that he doubts that there were more
than one or two dozen agents in each population/generation. This is consistent
with AlphaStar I believe, where the number of agents was something like that
IIRC...
Anyhow, suppose 30 agents per generation. Then that's a cost of $5,000/mo x 1.3
months x 30 agents = $195,000 to train the fifth generation of agents. The
previous two generations were probably quicker and cheaper. In total the price
is probably, therefore, something like half a million dollars of compute?
This seems surprisingly low to me. About one order of magnitude less than I
expected. What's going on? Maybe it really was that cheap. If so, why? Has the
price dropped since AlphaStar? Probably... It's also possible this just used
less compute than AlphaStar did...

32dMakes sense given the spinning-top [https://arxiv.org/abs/2004.09468] topology
of games. These tasks are probably not complex enough to need a lot of distinct
agents/populations to traverse the wide part to reach the top where you then
need little diversity to converge on value-equivalent models.
One observation: you can't run SC2 environments on a TPU, and when you can pack
the environment and agents together onto a TPU and batch everything with no
copying, you use the hardware closer to its full potential
[https://www.gwern.net/notes/Faster#gwern-notes-sparsity], see the Podracer
[https://arxiv.org/abs/2104.06272#deepmind] numbers.

22dAlso for comparison, I think this means these models were about twice as big as
AlphaStar. That's interesting.

(*Warning: this post is rough and in the weeds. I expect most readers should skip it and wait for a clearer synthesis later.)*

In a recent post I discussed one reason that a naive alignment strategy might go wrong, by learning to “predict what humans would say” rather than “answer honestly.” In this post I want to describe another problem that feels very similar but may require new ideas to solve.

In brief, I’m interested in the case where:

- The simplest way for an AI to answer a question is to first translate from its internal model of the world into the human’s model of the world (so that it can talk about concepts like “tree” that may not exist in its native model of the world).
- The simplest way to translate between the

Note that HumanAnswer and IntendedAnswer do different things. HumanAnswer spreads out its probability mass more, by first making an observation and then taking the whole distribution over worlds that were consistent with it.

Abstracting out Answer, let's just imagine that our AI outputs a distribution over the space of trajectories in the human ontology, and somehow we define a reward function evaluated by the human in hindsight after getting the observation . The idea is that this is calculated by having the A... (read more)

22dCausal structure is an intuitively appealing way to pick out the "intended"
translation between an AI's model of the world and a human's model. For example,
intuitively "There is a dog" causes "There is a barking sound." If we ask our
neural net questions like "Is there a dog?" and it computes its answer by
checking "Does a human labeler think there is a dog?" then its answers won't
match the expected causal structure---so maybe we can avoid these kinds of
answers.
What does that mean if we apply typical definitions of causality to ML training?
* If we define causality in terms of interventions, then this helps iff we have
interventions in which the labeler is mistaken. In general, it seems we could
just include examples with such interventions in the training set.
* Similarly, if we use some kind of closest-possible-world semantics, then we
need to be able to train models to answer questions consistently about nearby
worlds in which the labeler is mistaken. It's not clear how to train a system
to do that. Probably the easiest is to have a human labeler in world X
talking about what would happen in some other world Y, where the labeling
process is potentially mistaken. (As in "decoupled rl
[https://arxiv.org/pdf/1705.08417.pdf]" approaches.) However, in this case it
seems liable to learn the "instrumental policy" that asks "What does a human
in possible world X think about what would happen in world Y?" which seems
only slightly harder than the original.
* We could talk about conditional independencies that we expect to remain
robust on new distributions (e.g. in cases where humans are mistaken). I'll
discuss this a bit in a reply.
Here's an abstract example to think about these proposals, just a special case
of the example from this post
[https://www.alignmentforum.org/posts/SRJ5J9Tnyq7bySxbt/answering-questions-honestly-given-world-model-mismatches]
.
* Suppose that reality M is described as a causal graph X --> A -->

22dThis is also a way to think about the proposals in this post and the reply
[https://www.alignmentforum.org/posts/GxzEnkSFL5DnQEAsZ/paulfchristiano-s-shortform?commentId=swxCRdj3amrQjYJZD]
:
* The human believes that A' and B' are related in a certain way for
simple+fundamental reasons.
* On the training distribution, all of the functions we are considering
reproduce the expected relationship. However, the reason that they reproduce
the expected relationship is quite different.
* For the intended function, you can verify this relationship by looking at the
link (A --> B) and the coarse-graining applied to A and B, and verify that
the probabilities work out. (That is, I can replace all of the rest of the
computational graph with nonsense, or independent samples, and get the same
relationship.)
* For the bad function, you have to look at basically the whole graph. That is,
it's not the case that the human's beliefs about A' and B' have the right
relationship for arbitrary Ys, they only have the right relationship for a
very particular distribution of Ys. So to see that A' and B' have the right
relationship, we need to simulate the actual underlying dynamics where A -->
B, since that creates the correlations in Y that actually lead to the
expected correlations between A' and B'.
* It seems like we believe not only that A' and B' are related in a certain
way, but that the relationship should be for simple reasons, and so there's a
real sense in which it's a bad sign if we need to do a ton of extra compute
to verify that relationship. I still don't have a great handle on that kind
of argument. I suspect it won't ultimately come down to "faster is better,"
though as a heuristic that seems to work surprisingly well. I think that this
feels a bit more plausible to me as a story for why faster would be better
(but only a bit).
* It's not always going to be quite this cut and dried---depending on the
structu

22dSo are there some facts about conditional independencies that would privilege
the intended mapping? Here is one option.
We believe that A' and C' should be independent conditioned on B'. One problem
is that this isn't even true, because B' is a coarse-graining and so there are
in fact correlations between A' and C' that the human doesn't understand. That
said, I think that the bad map introduces further conditional correlations, even
assuming B=B'. For example, if you imagine Y preserving some facts about A' and
C', and if the human is sometimes mistaken about B'=B, then we will introduce
extra correlations between the human's beliefs about A' and C'.
I think it's pretty plausible that there are necessarily some "new" correlations
in any case where the human's inference is imperfect, but I'd like to understand
that better.
So I think the biggest problem is that none of the human's believed conditional
independencies actually hold---they are both precise, and (more problematically)
they may themselves only hold "on distribution" in some appropriate sense.
This problem seems pretty approachable though and so I'm excited to spend some
time thinking about it.

Actually if A --> B --> C and I observe some function of (A, B, C) it's just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it's actually difficult before concluding that this kid of conditional independence structure is potentially useful.

36dSuppose I am interested in finding a program M whose input-output behavior has
some property P that I can probabilistically check relatively quickly (e.g. I
want to check whether M implements a sparse cut of some large implicit graph). I
believe there is some simple and fast program M that does the trick. But even
this relatively simple M is much more complex than the specification of the
property P.
Now suppose I search for the simplest program running in time T that has
property P. If T is sufficiently large, then I will end up getting the program
"Search for the simplest program running in time T' that has property P, then
run that." (Or something even simpler, but the point is that it will make no
reference to the intended program M since encoding P is cheaper.)
I may be happy enough with this outcome, but there's some intuitive sense in
which something weird and undesirable has happened here (and I may get in a
distinctive kind of trouble if P is an approximate evaluation). I think this is
likely to be a useful maximally-simplified example to think about.

26dThis is interesting to me for two reasons:
* [Mainly] Several proposals for avoiding the instrumental policy work by
penalizing computation. But I have a really shaky philosophical grip on why
that's a reasonable thing to do, and so all of those solutions end up feeling
weird to me. I can still evaluate them based on what works on concrete
examples, but things are slippery enough that plan A is getting a handle on
why this is a good idea.
* In the long run I expect to have to handle learned optimizers by having the
outer optimizer instead directly learn whatever the inner optimizer would
have learned. This is an interesting setting to look at how that works out.
(For example, in this case the outer optimizer just needs to be able to
represent the hypothesis "There is a program that has property P and runs in
time T' " and then do its own search over that space of faster programs.)

26dIn traditional settings, we are searching for a program M that is simpler than
the property P. For example, the number of parameters in our model should be
smaller than the size of the dataset we are trying to fit if we want the model
to generalize. (This isn't true for modern DL because of subtleties with SGD
optimizing imperfectly and implicit regularization and so on, but spiritually I
think it's still fine..)
But this breaks down if we start doing something like imposing consistency
checks and hoping that those change the result of learning. Intuitively it's
also often not true for scientific explanations---even simple properties can be
surprising and require explanation, and can be used to support theories that are
much more complex than the observation itself.
Some thoughts:
1. It's quite plausible that in these cases we want to be doing something other
than searching over programs. This is pretty clear in the "scientific
explanation" case, and maybe it's the way to go for the kinds of alignment
problems I've been thinking about recently.
A basic challenge with searching over programs is that we have to interpret
the other data. For example, if "correspondence between two models of
physics" is some kind of different object like a description in natural
language, then some amplified human is going to have to be thinking about
that correspondence to see if it explains the facts. If we search over
correspondences, some of them will be "attacks" on the human that basically
convince them to run a general computation in order to explain the data. So
we have two options: (i) perfectly harden the evaluation process against
such attacks, (ii) try to ensure that there is always some way to just
directly do whatever the attacker convinced the human to do. But (i) seems
quite hard, and (ii) basically requires us to put all of the generic
programs in our search space.
2. It's also quite plausible th

36dThe speed prior [https://en.wikipedia.org/wiki/Speed_prior] is calibrated such
that this never happens if the learned optimizer is just using brute force---if
it needs to search over 1 extra bit then it will take 2x longer, offsetting the
gains.
That means that in the regime where P is simple, the speed prior is the "least
you can reasonably care about speed"---if you care even less, you will just end
up pushing the optimization into an inner process that is more concerned with
speed and is therefore able to try a bunch of options.
(However, this is very mild, since the speed prior cares only a tiny bit about
speed. Adding 100 bits to your program is the same as letting it run 2^100 times
longer, so you are basically just optimizing for simplicity.)
To make this concrete, suppose that I instead used the kind-of-speed prior,
where taking 4x longer is equivalent to using 1 extra bit of description
complexity. And suppose that P is very simple relative to the complexities of
the other objects involved. Suppose that the "object-level" program M has 1000
bits and runs in 2^2000 time, so has kind-of-speed complexity 2000 bits. A
search that uses the speed prior will be able to find this algorithm in 2^3000
time, and so will have a kind-of-speed complexity of 1500 bits. So the
kind-of-speed prior will just end up delegating to the speed prior.

26dThe speed prior still delegates to better search algorithms though. For example,
suppose that someone is able to fill in a 1000 bit program using only 2^500
steps of local search. Then the local search algorithm has speed prior
complexity 500 bits, so will beat the object-level program. And the prior we'd
end up using is basically "2x longer = 2 more bits" instead of "2x longer = 1
more bit," i.e. we end up caring more about speed because we delegated.
The actual limit on how much you care about speed is given by whatever search
algorithms work best. I think it's likely possible to "expose" what is going on
to the outer optimizer (so that it finds a hypothesis like "This local search
algorithm is good" and then uses it to find an object-level program, rather than
directly finding a program that bundles both of them together). But I'd guess
intuitively that it's just not even meaningful to talk about the "simplest"
programs or any prior that cares less about speed than the optimal search
algorithm.

This is a linkpost for https://arxiv.org/abs/1912.01683

Previously: *Seeking Power Is Often Robustly Instrumental In MDPs*

**Key takeaways**.

- The structure of the agent's environment often causes instrumental convergence.
**In many situations, there are (potentially combinatorially) many ways for power-seeking to be optimal, and relatively few ways for it not to be optimal.** - My previous results said something like: in a range of situations, when you're maximally uncertain about the agent's objective, this uncertainty assigns high probability to objectives for which power-seeking is optimal.
- My new results prove that in a range of situations, seeking power is optimal for
*most*agent objectives (for a particularly strong formalization of 'most').

More generally, the new results say something like: in a range of situations, for most beliefs you could have about the agent's objective, these beliefs assign high probability to reward functions

- My new results prove that in a range of situations, seeking power is optimal for

Added to the post:

Relatedly [to power-seeking under the simplicity prior], Rohin Shah wrote:

if you know that an agent is maximizing the expectation of an

explicitly representedutility function, I would expect that to lead to goal-driven behavior most of the time, since the utility function must be relatively simple if it is explicitly represented, andsimpleutility functions seem particularly likely to lead to goal-directed behavior.

I've been poking at Evan's Clarifying Inner Alignment Terminology. His post gives two separate pictures (the objective-focused approach, which he focuses on, and the generalization-focused approach, which he mentions at the end). We can consolidate those pictures into one and-or graph as follows:

And-or graphs make explicit which subgoals are jointly sufficient, by drawing an arc between those subgoal lines. So, for example, this claims that *intent alignment + capability robustness* would be sufficient for *impact alignment*, but alternatively, *outer alignment + robustness* would also be sufficient.

The red represents what belongs entirely to the generalization-focused path. The yellow represents what belongs entirely to the objective-focused path. The blue represents everything else. (In this diagram, all the blue is on *both* paths, but that will not be the case...

For a while, I've thought that the strategy of "split the problem into a complete set of necessary sub-goals" is incomplete. It produces problem factorizations, but it's not sufficient to produce *good* problem factorizations - it usually won't cut reality at clean joints. That was my main concern with Evan's factorization, and it also applies to all of these, but I couldn't quite put my finger on what the problem was.

I think I can explain it now: when I say I want a factorization of alignment to "cut reality at the joints", I think what I mean is that each ... (read more)

85dI like the addition of the pseudo-equivalences; the graph seems a lot more
accurate as a representation of my views once that's done.
I'm not too keen on (2) since I don't expect mesa objectives to exist in the
relevant sense. For (1), I'd note that we need to get it right on the situations
that actually happen, rather than all situations. We can also have systems that
only need to work for the next N timesteps, after which they are retrained again
given our new understanding of the world; this effectively limits how much
distribution shift can happen. Then we could do some combination of the
following:
1. Build neural net theory. We currently have a very poor understanding of why
neural nets work; if we had a better understanding it seems plausible we
could have high confidence in when a neural net would generalize correctly.
(I'm imagining that neural net theory goes from how-I-imagine-physics-looked
before Newton, and the same after Newton.)
2. Use techniques like adversarial training to "robustify" the model against
moderate distribution shifts (which might be sufficient to work for the next
N timesteps, after which you "robustify" again).
3. Make these techniques work better through interpretability / transparency.
4. Use checks and balances. For example, if multiple generalizations are
possible, train an ensemble of models and only do something if they all
agree on it. Or train an actor agent combined with an overseer agent that
has veto power over all actions. Or an ensemble of actors, each of which
oversees the other actors and has veto power over them.
These aren't "clean", in the sense that you don't get a nice formal guarantee at
the end that your AI system is going to (try to) do what you want in all
situations, but I think getting an actual literal guarantee is pretty doomed
anyway (among other things, it seems hard to get a definition for "all
situations" that avoids the no-free-lunch theorem, though I sup

14dSame, but how optimistic are you that we could figure out how to shape the
motivations or internal "goals" (much more loosely defined than
"mesa-objective") of our models via influencing the training objective/reward,
the inductive biases of the model, the environments they're trained in, some
combination of these things, etc.?
Yup, if you want "clean," I agree that you'll have to either assume a
distribution over possible inputs, or identify a perturbation set over possible
test environments to avoid NFL.

23dThat seems great, e.g. I think by far the best thing you can do is to make sure
that you finetune using a reward function / labeling process that reflects what
you actually want (i.e. what people typically call "outer alignment"). I
probably should have mentioned that too, I was taking it as a given but I really
shouldn't have.
For inductive biases + environments, I do think controlling those appropriately
would be useful and I would view that as an example of (1) in my previous
comment.

45dBut it seems to me that there's something missing in terms of acceptability.
The definition of "objective robustness" I used says "aligns with the base
objective" (including off-distribution). But I think this isn't an appropriate
representation of your approach. Rather, "objective robustness" has to be
defined something like "generalizes acceptably". Then, ideas like adversarial
training and checks and balances make sense as a part of the story.
WRT your suggestions, I think there's a spectrum from "clean" to "not clean",
and the ideas you propose could fall at multiple points on that spectrum
(depending on how they are implemented, how much theory backs them up, etc). So,
yeah, I favor "cleaner" ideas than you do, but that doesn't rule out this path
for me.

24dYeah, strong +1.

24dGreat! I feel like we're making progress on these basic definitions.

36dShouldn’t this be “intent alignment + capability robustness or outer alignment +
robustness”?
Btw, I plan to post more detailed comments in response here and to your other
post, just wanted to note this so hopefully there’s no confusion in interpreting
your diagram.

25dYep, fixed.

Then I'm confused what you meant by

Seems like if the different heads do not share weights then "the parameters in f1" is perfectly well-defined?

Yeah, sorry, by "conditioning" there I meant "assuming that the algorithm correctly chose the right world mod... (read more)