Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion

The following is a partially redacted and lightly edited transcript of a chat conversation about AGI between Eliezer Yudkowsky and a set of invitees in early September 2021. By default, all other participants are anonymized as "Anonymous".

I think this Nate Soares quote (excerpted from Nate's response to a report by Joe Carlsmith) is a useful context-setting preface regarding timelines, which weren't discussed as much in the transcript:

[...] My odds [of AGI by the year 2070] are around 85%[...]

I can list a handful of things that drive my probability of AGI-in-the-next-49-years above 80%:

1. 50 years ago was 1970. The gap between AI systems then and AI systems now seems pretty plausibly greater than the remaining gap, even before accounting the recent dramatic increase in the rate of progress,

...

In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I'd feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors.

Wish granted

Posted in my personal capacity

The AGI governance community has recently converged on compute governance[1] as a promising lever for reducing existential risks from AI.

One likely building block for any maximally secure compute governance regime is stock and flow accounting of (some kinds of) compute: i.e., requiring realtime accurate declaration to regulators of who possesses which uniquely numbered regulated chips, with penalties for undeclared or unauthorized[2] transfers.

To understand the optimal design and feasibility of such a regime, we seek historical analogies for similar regimes. One that we are already familiar with include:

  • Fissile nuclear material and other nuclear weapons components
  • Firearms
  • Some financial instruments
  • Automobiles
  • Real estate

What are other good existing or historical analogies for compute stock and flow accounting? An ideal analogy will have many of the following traits:[3]

  • The thing being tracked
...

Counterfeit tracking (e.g. for high-end clothing) could be another domain that has confronted this sort of tracking problem. Though I'm not sure if they do that with accounting versus e.g. tagging each individual piece of clothing.

(This post is largely a write-up of a conversation with Scott Garrabrant.)

Stable Pointers to Value

How do we build stable pointers to values?

As a first example, consider the wireheading problem for AIXI-like agents in the case of a fixed utility function which we know how to estimate from sense data. As discussed in Daniel Dewey's Learning What to Value and other places, if you try to implement this by putting the utility calculation in a box which rewards an AIXI-like RL agent, the agent can eventually learn to modify or remove the box, and happily does so if it can get more reward by doing so. This is because the RL agent predicts, and attempts to maximize, reward received. If it understands that it can modify the reward-giving

...

The questions there would be more like "what sequence of reward events will reinforce the desired shards of value within the AI?" and not "how do we philosophically do some fancy framework so that the agent doesn't end up hacking its sensors or maximizing the quotation of our values?". 

I think that it generally seems like a good idea to have solid theories of two different things:

  1. What is the thing we are hoping to teach the AI?
  2. What is the training story by which we mean to teach it?

I read your above paragraph as maligning (1) in favor of (2). In order... (read more)

2Abram Demski19h
I think I don't understand what you mean by (2), and as a consequence, don't understand the rest of this paragraph? WRT (1), I don't think I was being careful about the distinction in this post, but I do think the following: The problem of wireheading is certainly not that RL agents are trying to take control of their reward feedback by definition; I agree with your complaint about Daniel Dewey as quoted. It's a false explanation of why wireheading is a concern. The problem of wireheading is, rather, that none of the feedback the system gets can disincentivize (ie, provide differentially more loss for) models which are making this mistake. To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can't do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models (assuming similar competence levels in both, of course, which I admit is a bit fuzzy).

I've recently had several conversations about John Wentworth's post The Pointers Problem. I think there is some confusion about this post, because there are several related issues, which different people may take as primary. All of these issues are important to "the pointers problem", but John's post articulates a specific problem in a way that's not quite articulated anywhere else.

I'm aiming, here, to articulate the cluster of related problems, and say a few new-ish things about them (along with a lot of old things, hopefully put together in a new and useful way). I'll indicate which of these problems John was and wasn't highlighting.

This whole framing assumes we are interested in something like value learning / value loading. Not all approaches rely on this. I am not...

5Alex Turner2d
Suppose that many (EDIT: a few) of your value shards take as input the ghost latent variable in your world model. You learn ghosts aren't real. Let's say this basically sets the ghost-related latent variable value to false in all shard-relevant contexts. Then it seems perfectly fine that most of my shards keep on bidding away and determining my actions (e.g. protect my family), since most of my value shards are not in fact functions of the ghost latent variable. While it's indeed possible to contrive minds where most of their values are functions of a variable in the world model which will get removed by the learning process, it doesn't seem particularly concerning to me. (But I'm also probably not trying to tackle the problems in this post, or the superproblems which spawned them.) This doesn't seem relevant for non-AIXI RL agents which don't end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?
2Abram Demski19h
With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here [https://www.lesswrong.com/posts/yLTpo828duFQqPJfy/builder-breaker-for-deconfusion#Example__Wireheading] , wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach. Output-based evaluation (including supervised learning, and the most popular forms of unsupervised learning, and a lot of other stuff which treats models as black boxes implementing some input/output behavior or probability distribution or similar) can't distinguish between a model which is internalizing the desired concepts, vs a model which is instead modeling the actual feedback process instead. These two do different things, but not in a way that the feedback system can differentiate. In terms of shard theory, as I understand it, the point is that (absent arguments to the contrary, which is what we want to be able to construct), shards that implement feedback-modeling like this cannot be disincentivized by the feedback process, since they perform very well in those terms. Shards which do other things may or may not be disincentivized, but the feedback-modeling shards (if any are formed at any point) definitely won't, unless of course they're just not very good at their jobs. So the problem, then, is: how do we arrange training such that those shards have very little influence, in the end? How do we disincentivize that kind of reasoning at all? Plausibly, this should only be tackled as a knock-on effect of the real problem, actually giving good feedback which points in the right direction; however, it remains a powerful counterexample class which challenges many many proposals. (And therefore, trying to generate the analogue of the wireheading problem for a given proposal seems like a good sanity check.)
2Alex Turner2d
Insofar as I understand your point, I disagree. In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for people). If you commit to the specific view of outer/inner alignment, then now you also want your loss function to "represent" that goal in some way [https://www.lesswrong.com/posts/jnmG5jczvWbeRPcvG/four-usages-of-loss-in-ai]. I doubt this due to learning from scratch [https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values] . I think the question of "how do I identify what you want, in terms of a utility function?" is a bit sideways due to people not in fact having utility functions.[1] [#fna5wamt9t55]Insofar as the question makes sense, its answer probably takes the form of inductive biases: I might learn to predict the world via self-supervised learning and form concepts around other people having values and emotional states due to that being a simple convergent abstraction relatively pinned down by my training process, architecture, and data over my life, also reusing my self-modelling abstractions. It would be quite unnatural to model myself in one way (as valuing happiness) and others as having "irrational" shards which "value" anti-happiness but still end up behaving as if they value happiness. (That's not a sensible thing to say, on my ontology.) I think it's worth considering how I might go about helping a person from an uncontacted tribe who doesn't share my ontology. Conditional on them requesting help from me somehow, and my wanting to help them, and my deciding to do so—how would I carry out that process, internally? (Not reading the rest at the moment, may leave more comments later) 1. ^ [#fnrefa5wamt9t55]Human values take the form of decision-influences (i.e. shards) [https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-hu

I said: 

The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans.
[...]
In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.

You said: 

In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI's learned ontology (whether or not it has a concept for people).

Thinking ... (read more)

2Abram Demski18h
I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases -- or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent. This is because writing down the desired inductive biases as an explicit prior can help us to understand what's going on better. It's tempting to say that to understand how the brain learns, is to understand how it treats feedback as evidence, and updates on that evidence. Of course, there could certainly be other theoretical frames which are more productive. But at a deep level, if the learning works, the learning works because the feedback is evidence about the thing we want to learn, and the process which updates on that feedback embodies (something like) a good prior telling us how to update on that evidence. And if that framing is wrong somehow, it seems intuitive to me that the problem should be describable within that ontology, like how I think "utility function" is not a very good way to think about values because what is it a function of; we don't have a commitment to a specific low-level description of the universe which is appropriate for the input to a utility function. We can easily move beyond this by considering expected values as the "values/preferences" representation, without worrying about what underlying utility function generates those expected values. (I do not take the above to be a knockdown argument against "committing to the specific division between outer and inner alignment steers you wrong" -- I'm just saying things that seem true to me and plausibly relevant to the debate.)
2Abram Demski18h
I expect you'll say I'm missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is "figuring out the human prior", because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I'm fine with that. I am not dogmatic on orthodox bayesianism [https://www.lesswrong.com/posts/xJyY5QkQvNJpZLJRo/radical-probabilism-1]. I do not even like utility functions [https://www.lesswrong.com/posts/A8iGaZ3uHNNGgJeaD/an-orthodox-case-against-utility-functions] . I am totally fine with saying "inductive biases" instead of "prior"; I think it indeed pins down what I meant in a more accurate way (by virtue of, in itself, being a more vague and imprecise concept than "prior").

The authors apply an AlphaZero-like algorithm to discover new matrix multiplication algorithms. They do this by turning matrix multiplication into a one-player game, where the state represents how far from correct the current output is, moves are algorithmic instructions, and the reward is -1 per step (plus a terminal reward of -rank(final state), if the final state is not a zero tensor). On small matrices, they find that AlphaTensor can discover algorithms that use fewer scalar multiplications than the best known human-designed matrix multiplication algorithms. They apply this to find hardware-specific matmuls (by adding an additional reward equal to -time to the terminal state) that have a 10-20% larger speedup than Strassen's algorithm on NVIDIA V100s and TPU V2s (saving 4%/7.5% wall clock time).

Paper abstract:

Improving the efficiency

...
2Alex Flint21h
See also Scott Aaronson on experimental computational complexity theory (haha its a joke wait no maybe he's not joking wait what?) https://scottaaronson.blog/?p=252 [https://scottaaronson.blog/?p=252]
5johnswentworth21h
I'm surprised they got a paper out of this. The optimization problem they're solving isn't actually that hard at small sizes (like the example in Deepmind's post) and does not require deep learning; I played around with it just using a vanilla solver from scipy a few years ago, and found similar results. I assume the reason nobody bothered to publish results like Deepmind found is that they don't yield a big-O speedup on recursive decompositions compared to just using Strassen's algorithm; that was why I never bothered writing up the results from my own playing around. [ETA: actually they did find a big-O speedup over Strassen, see Paul below.] Computationally brute-forcing the optimization problem for Strassen's algorithm certainly isn't a new idea, and it doesn't look like the deep learning part actually made any difference. Which isn't surprising; IIUC researchers in the area generally expect that practically-useful further big-O improvements on matmul will need a different functional form from Strassen (and therefore wouldn't be in the search space of the optimization problem for Strassen-like algorithms). The Strassen-like optimization problem has been pretty heavily searched for decades now.
4Lawrence Chan20h
(Most of my comment was ninja'ed by Paul) I'll add that I'm pretty sure that RL is doing something. The authors claim that no one has applied search methods for 4x4 matrix multiplication or larger, and the branching factor on brute force search without a big heuristic grows something like the 6th power of n? So it seems doubtful that they will scale. That being said, I agree that it's a bit odd to not do a head-to-head comparison at equal compute, though. The authors just cite related work (which uses much less compute) and claims superiority over them.
19Paul Christiano21h
Their improved 4x4 matrix multiplication algorithm does yield improved asymptotics compared to just using Strassen's algorithm. They do 47 multiplies for 4x4 matrix multiplication, so after log(n)/log(4) rounds of decomposition you get a runtime of 47^(log(n) / log(4)), which is less than 7^(log(n) / log(2)) from Strassen. Of course this is not state of the art asymptotics because we know other bigger improvements over Strassen for sufficiently large matrices. I'm not sure what you mean by "different functional form from Strassen" but it is known that you can approximate the matrix multiplication exponent arbitrarily well by recursively applying an efficient matrix multiplication algorithm for constant-sized matrices. People do use computer-assisted search to find matrix multiplication algorithms, and as you say the optimization problem has been studied extensively. As far as I can tell the results in this paper are better than anything that is known for 4x4 or 5x5 matrices, and I think they give the best asymptotic performance of any explicitly known multiplication algorithm on small matrices. I might be missing something, but if not then I'm quite skeptical that you got anything similar. As I mentioned, we know better algorithms for sufficiently large matrices. But for 8k x 8k matrices in practice I believe that 1-2 rounds of Strassen is state of the art. It looks like the 47-multiply algorithm they find is not better than Strassen in practice on gpus at that scale because of the cost of additions and other practical considerations. But they also do an automated search based on measured running time rather than tensor rank alone, and they claim to find an algorithm that is ~4% faster than their reference implementation of Strassen for 8k matrices on a v100 (which is itself ~4% faster than the naive matmul). This search also probably used more compute than existing results, and that may be a more legitimate basis for a complaint. I don't know if they report com
4johnswentworth20h
Ah, guess I should have looked at the paper. I foolishly assumed that if they had an actual big-O improvement over Strassen, they'd mention that important fact in the abstract and blog post.
10Paul Christiano20h
Any improvement over Strassen on 4x4 matrices represents an asymptotic improvement over Strassen.

Indeed. Unfortunately, I didn't catch that when skimming.

4Alex Flint21h
Beyond that it seems tensorflow and pytorch don't even bother to use Strassen's algorithm over N^3 matrix multiplication (or perhaps something Strassen-like is used in the low-level GPU circuits?).

One barrier for this general approach: the basic argument that something like this would work is that if one shard is aligned, and every shard has veto power over changes (similar to the setup in Why Subagents?), then things can't get much worse for humanity. We may fall well short of our universe-scale potential, but at least X-risk is out.

Problem is, that argument requires basically-perfect alignment of the one shard (or possibly a set of shards which together basically-perfectly represent human values). If we try to weaken it to e.g. a bunch of shards w... (read more)

5Alex Turner2d
I often get the impression that people weigh off e.g. doing shard theory alignment strategies under the shard theory alignment picture, versus inner/outer research under the inner/outer alignment picture, versus... And insofar as this impression is correct, this is a mistake. There is only one way alignment is. If inner/outer is altogether a more faithful picture of those dynamics: * relatively coherent singular mesa-objectives form in agents, albeit not necessarily always search-based * more fragility of value and difficulty in getting the mesa objective just right, with little to nothing in terms of "consolation prizes" for slight mistakes in value loading * possibly low path dependence on the update process then we have to solve alignment in that world. If shard theory is altogether more faithful, then we live under those dynamics: * gents learn contextual distributions of values around e.g. help people or acquire coins, some of which cohere and equilibrate into the agent's endorsed preferences and eventual utility function * something like values handshakes and inner game theory occurs in AI * we can focus on getting a range of values endorsed and thereby acquire value via being "at the bargaining table" vis some human-compatible values representing themselves in the final utility function * which implies meaningful success and survival from "partial alignment" And under these dynamics, inner and outer alignment are antinatural hard problems. Or maybe neither of these pictures are correct and reasonable, and alignment is some other way. But either way, there's one way alignment is. And whatever way that is, it is against that anvil that we hammer the AI's cognition with loss updates. When considering a research agenda, you aren't choosing a background set of alignment dynamics as well.
4Alex Turner7d
I plan to mentor several people to work on shard theory and agent foundations this winter through SERI MATS. Apply here [https://www.serimats.org/agent-foundations] if you're interested in working with me and Quintin.
Load More