The** **following is a partially redacted and lightly edited transcript of a chat conversation about AGI between Eliezer Yudkowsky and a set of invitees in early September 2021. By default, all other participants are anonymized as "Anonymous".

I think this Nate Soares quote (excerpted from Nate's response to a report by Joe Carlsmith) is a useful context-setting preface regarding timelines, which weren't discussed as much in the transcript:

...[...] My odds [of AGI by the year 2070] are around 85%[...]

I can list a handful of things that drive my probability of AGI-in-the-next-49-years above 80%:

1. 50 years ago was 1970. The gap between AI systems then and AI systems now seems pretty plausibly greater than the remaining gap, even before accounting the recent dramatic increase in the rate of progress,

*Posted in my personal capacity*

The AGI governance community has recently converged on compute governance^{[1]} as a promising lever for reducing existential risks from AI.

One likely building block for any maximally secure compute governance regime is **stock and flow accounting of (some kinds of) compute**: i.e., requiring realtime accurate declaration to regulators of who possesses which uniquely numbered regulated chips, with penalties for undeclared or unauthorized^{[2]} transfers.

To understand the optimal design and feasibility of such a regime, we seek historical analogies for similar regimes. One that we are already familiar with include:

- Fissile nuclear material and other nuclear weapons components
- Firearms
- Some financial instruments
- Automobiles
- Real estate

**What are other good existing or historical analogies for compute stock and flow accounting**? An ideal analogy will have many of the following traits:^{[3]}

- The thing being tracked

Counterfeit tracking (e.g. for high-end clothing) could be another domain that has confronted this sort of tracking problem. Though I'm not sure if they do that with accounting versus e.g. tagging each individual piece of clothing.

(This post is largely a write-up of a conversation with Scott Garrabrant.)

How do we build stable pointers to values?

As a first example, consider the wireheading problem for AIXI-like agents in the case of a fixed utility function which we know how to estimate from sense data. As discussed in Daniel Dewey's Learning What to Value and other places, if you try to implement this by putting the utility calculation in a box which rewards an AIXI-like RL agent, the agent can eventually learn to modify or remove the box, and happily does so if it can get more reward by doing so. This is because the RL agent predicts, and attempts to maximize, reward received. If it understands that it can modify the reward-giving

...The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?".

I think that it generally seems like a good idea to have solid theories of two different things:

*What is the thing*we are hoping to teach the AI?*What is the training story*by which we mean to teach it?

I read your above paragraph as maligning (1) in favor of (2). In order... (read more)

219h

I think I don't understand what you mean by (2), and as a consequence, don't
understand the rest of this paragraph?
WRT (1), I don't think I was being careful about the distinction in this post,
but I do think the following:
The problem of wireheading is certainly not that RL agents are trying to take
control of their reward feedback by definition; I agree with your complaint
about Daniel Dewey as quoted. It's a false explanation of why wireheading is a
concern.
The problem of wireheading is, rather, that none of the feedback the system gets
can disincentivize (ie, provide differentially more loss for) models which are
making this mistake. To the extent that the training story is about ruling out
bad hypotheses, or disincentivizing bad behaviors, or providing differentially
more loss for undesirable models compared to more-desirable models, RL can't do
that with respect to the specific failure mode of wireheading. Because an
accurate model of the process actually providing the reinforcements will always
do at least as well in predicting those reinforcements as alternative models
(assuming similar competence levels in both, of course, which I admit is a bit
fuzzy).

I've recently had several conversations about John Wentworth's post The Pointers Problem. I think there is some confusion about this post, because there are several related issues, which different people may take as primary. All of these issues are important to "the pointers problem", but John's post articulates a specific problem in a way that's not quite articulated anywhere else.

I'm aiming, here, to articulate the cluster of related problems, and say a few new-ish things about them (along with a lot of old things, hopefully put together in a new and useful way). I'll indicate which of these problems John was and wasn't highlighting.

This whole framing *assumes* we are interested in something like value learning / value loading. Not all approaches rely on this. I am not...

52d

Suppose that many (EDIT: a few) of your value shards take as input the ghost
latent variable in your world model. You learn ghosts aren't real. Let's say
this basically sets the ghost-related latent variable value to false in all
shard-relevant contexts. Then it seems perfectly fine that most of my shards
keep on bidding away and determining my actions (e.g. protect my family), since
most of my value shards are not in fact functions of the ghost latent variable.
While it's indeed possible to contrive minds where most of their values are
functions of a variable in the world model which will get removed by the
learning process, it doesn't seem particularly concerning to me. (But I'm also
probably not trying to tackle the problems in this post, or the superproblems
which spawned them.)
This doesn't seem relevant for non-AIXI RL agents which don't end up caring
about reward or explicitly weighing hypotheses over reward as part of the
motivational structure? Did you intend it to be?

219h

With almost any kind of feedback process (IE: any concrete proposals that I know
of), similar concerns arise. As I argue here
[https://www.lesswrong.com/posts/yLTpo828duFQqPJfy/builder-breaker-for-deconfusion#Example__Wireheading]
, wireheading is one example of a very general failure mode. The failure mode is
roughly: the process actually generating feedback is, too literally, identified
with the truth/value which that feedback is trying to teach.
Output-based evaluation (including supervised learning, and the most popular
forms of unsupervised learning, and a lot of other stuff which treats models as
black boxes implementing some input/output behavior or probability distribution
or similar) can't distinguish between a model which is internalizing the desired
concepts, vs a model which is instead modeling the actual feedback process
instead. These two do different things, but not in a way that the feedback
system can differentiate.
In terms of shard theory, as I understand it, the point is that (absent
arguments to the contrary, which is what we want to be able to construct),
shards that implement feedback-modeling like this cannot be disincentivized by
the feedback process, since they perform very well in those terms. Shards which
do other things may or may not be disincentivized, but the feedback-modeling
shards (if any are formed at any point) definitely won't, unless of course
they're just not very good at their jobs.
So the problem, then, is: how do we arrange training such that those shards have
very little influence, in the end? How do we disincentivize that kind of
reasoning at all?
Plausibly, this should only be tackled as a knock-on effect of the real problem,
actually giving good feedback which points in the right direction; however, it
remains a powerful counterexample class which challenges many many proposals.
(And therefore, trying to generate the analogue of the wireheading problem for a
given proposal seems like a good sanity check.)

22d

Insofar as I understand your point, I disagree. In machine-learning terms, this
is the question of how to train an AI whose internal cognition reliably unfolds
into caring about people, in whatever form that takes in the AI's learned
ontology (whether or not it has a concept for people). If you commit to the
specific view of outer/inner alignment, then now you also want your loss
function to "represent" that goal in some way
[https://www.lesswrong.com/posts/jnmG5jczvWbeRPcvG/four-usages-of-loss-in-ai].
I doubt this due to learning from scratch
[https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values]
. I think the question of "how do I identify what you want, in terms of a
utility function?" is a bit sideways due to people not in fact having utility
functions.[1] [#fna5wamt9t55]Insofar as the question makes sense, its answer
probably takes the form of inductive biases: I might learn to predict the world
via self-supervised learning and form concepts around other people having values
and emotional states due to that being a simple convergent abstraction
relatively pinned down by my training process, architecture, and data over my
life, also reusing my self-modelling abstractions. It would be quite unnatural
to model myself in one way (as valuing happiness) and others as having
"irrational" shards which "value" anti-happiness but still end up behaving as if
they value happiness. (That's not a sensible thing to say, on my ontology.)
I think it's worth considering how I might go about helping a person from an
uncontacted tribe who doesn't share my ontology. Conditional on them requesting
help from me somehow, and my wanting to help them, and my deciding to do so—how
would I carry out that process, internally?
(Not reading the rest at the moment, may leave more comments later)
1. ^ [#fnrefa5wamt9t55]Human values take the form of decision-influences (i.e. shards)
[https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-hu

I said:

The basic idea behind

compressed pointersis that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans.

[...]

In machine-learning terms, this is the question of how to specify a loss function for the purpose oflearninghuman values.

You said:

In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for

people).

Thinking ... (read more)

218h

I think it is reasonable as engineering practice to try and make a fully
classically-Bayesian model of what we think we know about the necessary
inductive biases -- or, perhaps more realistically, a model which only violates
classic Bayesian definitions where necessary in order to represent what we want
to represent.
This is because writing down the desired inductive biases as an explicit prior
can help us to understand what's going on better.
It's tempting to say that to understand how the brain learns, is to understand
how it treats feedback as evidence, and updates on that evidence. Of course,
there could certainly be other theoretical frames which are more productive. But
at a deep level, if the learning works, the learning works because the feedback
is evidence about the thing we want to learn, and the process which updates on
that feedback embodies (something like) a good prior telling us how to update on
that evidence.
And if that framing is wrong somehow, it seems intuitive to me that the problem
should be describable within that ontology, like how I think "utility function"
is not a very good way to think about values because what is it a function of;
we don't have a commitment to a specific low-level description of the universe
which is appropriate for the input to a utility function. We can easily move
beyond this by considering expected values as the "values/preferences"
representation, without worrying about what underlying utility function
generates those expected values.
(I do not take the above to be a knockdown argument against "committing to the
specific division between outer and inner alignment steers you wrong" -- I'm
just saying things that seem true to me and plausibly relevant to the debate.)

218h

I expect you'll say I'm missing something, but to me, this sounds like a
language dispute. My understanding of your recent thinking holds that the
important goal is to understand how human learning reliably results in human
values. The Bayesian perspective on this is "figuring out the human prior",
because a prior is just a way-to-learn. You might object to the overly Bayesian
framing of that; but I'm fine with that. I am not dogmatic on orthodox
bayesianism
[https://www.lesswrong.com/posts/xJyY5QkQvNJpZLJRo/radical-probabilism-1]. I do
not even like utility functions
[https://www.lesswrong.com/posts/A8iGaZ3uHNNGgJeaD/an-orthodox-case-against-utility-functions]
.
I am totally fine with saying "inductive biases" instead of "prior"; I think it
indeed pins down what I meant in a more accurate way (by virtue of, in itself,
being a more vague and imprecise concept than "prior").

This is a linkpost for https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor

The authors apply an AlphaZero-like algorithm to discover new matrix multiplication algorithms. They do this by turning matrix multiplication into a one-player game, where the state represents how far from correct the current output is, moves are algorithmic instructions, and the reward is -1 per step (plus a terminal reward of -rank(final state), if the final state is not a zero tensor). On small matrices, they find that AlphaTensor can discover algorithms that use fewer scalar multiplications than the best known human-designed matrix multiplication algorithms. They apply this to find hardware-specific matmuls (by adding an additional reward equal to -time to the terminal state) that have a 10-20% larger speedup than Strassen's algorithm on NVIDIA V100s and TPU V2s (saving 4%/7.5% wall clock time).

Paper abstract:

...Improving the efficiency

221h

See also Scott Aaronson on experimental computational complexity theory (haha
its a joke wait no maybe he's not joking wait what?)
https://scottaaronson.blog/?p=252 [https://scottaaronson.blog/?p=252]

521h

I'm surprised they got a paper out of this. The optimization problem they're
solving isn't actually that hard at small sizes (like the example in Deepmind's
post) and does not require deep learning; I played around with it just using a
vanilla solver from scipy a few years ago, and found similar results. I assume
the reason nobody bothered to publish results like Deepmind found is that they
don't yield a big-O speedup on recursive decompositions compared to just using
Strassen's algorithm; that was why I never bothered writing up the results from
my own playing around.
[ETA: actually they did find a big-O speedup over Strassen, see Paul below.]
Computationally brute-forcing the optimization problem for Strassen's algorithm
certainly isn't a new idea, and it doesn't look like the deep learning part
actually made any difference. Which isn't surprising; IIUC researchers in the
area generally expect that practically-useful further big-O improvements on
matmul will need a different functional form from Strassen (and therefore
wouldn't be in the search space of the optimization problem for Strassen-like
algorithms). The Strassen-like optimization problem has been pretty heavily
searched for decades now.

420h

(Most of my comment was ninja'ed by Paul)
I'll add that I'm pretty sure that RL is doing something. The authors claim that
no one has applied search methods for 4x4 matrix multiplication or larger, and
the branching factor on brute force search without a big heuristic grows
something like the 6th power of n? So it seems doubtful that they will scale.
That being said, I agree that it's a bit odd to not do a head-to-head comparison
at equal compute, though. The authors just cite related work (which uses much
less compute) and claims superiority over them.

1921h

Their improved 4x4 matrix multiplication algorithm does yield improved
asymptotics compared to just using Strassen's algorithm. They do 47 multiplies
for 4x4 matrix multiplication, so after log(n)/log(4) rounds of decomposition
you get a runtime of 47^(log(n) / log(4)), which is less than 7^(log(n) /
log(2)) from Strassen.
Of course this is not state of the art asymptotics because we know other bigger
improvements over Strassen for sufficiently large matrices. I'm not sure what
you mean by "different functional form from Strassen" but it is known that you
can approximate the matrix multiplication exponent arbitrarily well by
recursively applying an efficient matrix multiplication algorithm for
constant-sized matrices.
People do use computer-assisted search to find matrix multiplication algorithms,
and as you say the optimization problem has been studied extensively. As far as
I can tell the results in this paper are better than anything that is known for
4x4 or 5x5 matrices, and I think they give the best asymptotic performance of
any explicitly known multiplication algorithm on small matrices. I might be
missing something, but if not then I'm quite skeptical that you got anything
similar.
As I mentioned, we know better algorithms for sufficiently large matrices. But
for 8k x 8k matrices in practice I believe that 1-2 rounds of Strassen is state
of the art. It looks like the 47-multiply algorithm they find is not better than
Strassen in practice on gpus at that scale because of the cost of additions and
other practical considerations. But they also do an automated search based on
measured running time rather than tensor rank alone, and they claim to find an
algorithm that is ~4% faster than their reference implementation of Strassen for
8k matrices on a v100 (which is itself ~4% faster than the naive matmul).
This search also probably used more compute than existing results, and that may
be a more legitimate basis for a complaint. I don't know if they report
com

420h

Ah, guess I should have looked at the paper. I foolishly assumed that if they
had an actual big-O improvement over Strassen, they'd mention that important
fact in the abstract and blog post.

1020h

Any improvement over Strassen on 4x4 matrices represents an asymptotic
improvement over Strassen.

Indeed. Unfortunately, I didn't catch that when skimming.

421h

Beyond that it seems tensorflow and pytorch don't even bother to use Strassen's
algorithm over N^3 matrix multiplication (or perhaps something Strassen-like is
used in the low-level GPU circuits?).

One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents?), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out.

Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards w... (read more)

52d

I often get the impression that people weigh off e.g. doing shard theory
alignment strategies under the shard theory alignment picture, versus
inner/outer research under the inner/outer alignment picture, versus...
And insofar as this impression is correct, this is a mistake. There is only one
way alignment is.
If inner/outer is altogether a more faithful picture of those dynamics:
* relatively coherent singular mesa-objectives form in agents, albeit not
necessarily always search-based * more fragility of value and difficulty in
getting the mesa objective just right, with little to nothing in terms of
"consolation prizes" for slight mistakes in value loading
* possibly low path dependence on the update process
then we have to solve alignment in that world.
If shard theory is altogether more faithful, then we live under those dynamics:
* gents learn contextual distributions of values around e.g. help people or
acquire coins, some of which cohere and equilibrate into the agent's endorsed
preferences and eventual utility function
* something like values handshakes and inner game theory occurs in AI
* we can focus on getting a range of values endorsed and thereby acquire value
via being "at the bargaining table" vis some human-compatible values
representing themselves in the final utility function * which implies
meaningful success and survival from "partial alignment"
And under these dynamics, inner and outer alignment are antinatural hard
problems.
Or maybe neither of these pictures are correct and reasonable, and alignment is
some other way.
But either way, there's one way alignment is. And whatever way that is, it is
against that anvil that we hammer the AI's cognition with loss updates. When
considering a research agenda, you aren't choosing a background set of alignment
dynamics as well.

47d

I plan to mentor several people to work on shard theory and agent foundations
this winter through SERI MATS. Apply here
[https://www.serimats.org/agent-foundations] if you're interested in working
with me and Quintin.

Wish granted!