All of Steven Byrnes's Comments + Replies

General alignment plus human values, or alignment via human values?


  1. I want the AI to have criteria that qualifies actions as acceptable, e.g. "it pattern-matches less than 1% to 'I'm causing destruction', and it pattern-matches less than 1% to 'the supervisor wouldn't like this', and it pattern-matches less than 1% to 'I'm changing my own motivation and control systems', and … etc. etc."
  2. If no action is acceptable, I want NOOP to be hardcoded as an always-acceptable default—a.k.a. "being paralyzed by indecision" in the face of a situation where all the options seem problematic. And then we humans are responsible for not
... (read more)
P₂B: Plan to P₂B Better

How about "if I contain two subagents with different goals, they should execute Pareto-improving trades with each other"? This is an aspect of "becoming more rational", but it's not very well described by your maxim, because the maxim includes "your goal" as if that's well defined, right?

Unrelated topic: Maybe I didn't read carefully enough, but intuitively I treat "making a plan" and "executing a plan" as different, and I normally treat the word "planning" as referring just to the former, not the latter. Is that what you mean? Because executing a plan is obviously necessary too ....

2Daniel Kokotajlo1dShooting from the hip: The maxim does include "your goal" as if that's well-defined, yeah. But this is fair, because this is a convergent instrumental goal; a system which doesn't have goals at all doesn't have convergent instrumental goals either. To put it another way: It's built into the definition of "planner" that there is a goal, a goal-like thing, something playing the role of goal, etc. Anyhow, so I would venture to say that insofar as "my subagents should execute pareto-improving trades" does not in fact further my goal, then it's not convergently instrumental, and if it does further my goal, then it's a special case of self-improvement or rationality or some other shard of P2B. Re point 2:
General alignment plus human values, or alignment via human values?

Ah, so you are arguing against (3)? (And what's your stance on (1)?)

Let's say you are assigned to be Alice's personal assistant.

  • Suppose Alice says "Try to help me as much as you can, while being VERY sure to avoid actions that I would regard as catastrophically bad. When in doubt, just don't do anything at all, that's always OK with me." I feel like Alice is not asking too much of you here. You'll observe her a lot, and ask her a lot of questions especially early on, and sometimes you'll fail to be useful, because helping her would require choosing among o
... (read more)
4Stuart Armstrong9hThanks for developing the argument. This is very useful. The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined. But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?
3Adam Shimi3dI'm with Steve on the idea that there's a difference between broad human preferences (something like common sense?) and particular and exact human preferences (what would be needed for ambitious value learning). Still, you (Stuart) made me realize that I didn't think explicitly about this need for broad human preferences in my splitting of the problem (be able to align, then point to what we want), but it's indeed implicit because I don't care about being able to do "anything", just the sort of things humans might want.
General alignment plus human values, or alignment via human values?

Here are three things that I believe:

  1. "aiming the AGI's motivation at something-in-particular" is a different technical research problem from "figuring out what that something-in-particular should be", and we need to pursue both these research problems in parallel, since they overlap relatively little.
  2. There is no circumstance where any reasonable person would want to build an AGI whose motivation has no connection whatsoever to human values / preferences / norms.
  3. We don't necessarily want to do "ambitious value alignment"—i.e., to build an AGI that fully und
... (read more)
5Stuart Armstrong3dI agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.
Brain-inspired AGI and the "lifetime anchor"

Note that assumption 4 also needs to include a claim that we figure out the "secret sauce" sooner than other paths to AGI, despite lots of effort being put into them already.

Yup, "time until AGI via one particular path" is always an upper bound to "time until AGI". I added a note, thanks.

These seem easily like the load-bearing part of the argument; I agree the stuff you listed follows from these assumptions but why should these assumptions be true? 

The only thing I'm arguing in this particular post is "IF assumptions THEN conclusion". This post is not making any argument whatsoever that you should put a high credence on the assumptions being true. :-)

Safety-capabilities tradeoff dials are inevitable in AGI

The I part I'll agree with is: If we look at a dial, we can ask the question:

If there's an AGI with a safety-capabilities tradeoff dial, to what extent is the dial's setting externally legible / auditable to third parties?

More legible / auditable is better, because it could help enforcement.

I agree with this, and I have just added it to the article. But I disagree with your suggestion that this is counter to what I wrote. In my mind, it's an orthogonal dimension along which dials can vary. I think it's good if the dial is auditable, and I think it's also g... (read more)

2Koen Holtman17dBased on what you say above, I do not think we fundamentally disagree. There are orthogonal dimensions to safety mechanism design which are all important. I somewhat singled out your line of 'the lower the better' because I felt that your taxation framing was too one-dimensional. There is another matter: in US/UK political discourse, it common that if someone wants to prevent the government from doing something useful, this something will be framed as a tax, or as interfering with economic efficiency. If someone does want the government to actually do a thing, in fact spend lavishly on doing it, the same thing will often be framed as enforcement. This observation says something about the quality of the political discourse. But as a continental European, it is not the quality of the discourse I want to examine here, only the rhetorical implications. When you frame your safety dials as taxation, then rhetorically you are somewhat shooting yourself in the foot, if you want proceed by arguing that these dials should not be thrown out of the discussion. When re-framed as enforcement, the cost of using these safety dials suddenly does not sound as problematic anymore. But enforcement, in a way that limits freedom of action, is indeed a burden to those at the receiving end, and if enforcement is too heavy they might seek to escape it altogether. I agree that perfectly inescapable watertight enforcement is practically nonexistent in this world, in fact I consider its non-existence to be more of a desirable feature of society than it is a bug. But to use your terminology, the level of enforcement applied to something is just one of these tradeoff dials that stink. That does not mean we should throw out the dial.
A brief review of the reasons multi-objective RL could be important in AI Safety Research

Great post, thanks for writing it!!

The links to are down at the moment, is that temporary, or did the website move or something?

Is it fair to say that all the things you're doing with multi-objective RL could also be called "single-objective RL with a more complicated objective"? Like, if you calculate the vector of values V, and then use a scalarization function S, then I could just say to you "Nope, you're doing normal single-objective RL, using the objective function S(V)." Right?

(Not that there's anything wrong with th... (read more)

1Roland Pihlakas10dYou can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process - as Ben said, during action selection. Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research. For example, I would say that transforming the rewards instead of Q values is more risk-averse as well as "fair" towards individual timesteps, since it does not average out the negative outcomes across time before exponentiating them. But it also results in slower learning by the agent. Finally there is a third approach which uses lexicographical ordering between objectives or sets of objectives. Vamplew has done work on this direction. This approach is truly multi-objective in the sense that there is no aggregation at all. Instead the vectors must be compared during RL action selection without aggregation. The downside is that it is unwieldy to have many objectives (or sets of objectives) lexicographically ordered. I imagine that the lexicographical approach and our continuous nonlinear transformation approaches are complementary. There could be for example two main sets of objectives: one set for alignment objectives, the other set for performance objectives. Inside a set there would be nonlinear transformation and then aggregation applied, but between the sets there would be lexicographical ordering applied. In other words there would be a hierarchy of objectives. By having only two sets in lexicographical ordering the lexicographical ordering does not become unwieldy. This approach would be a bit analogous to the approach used by constraint programming, though more flexible. The safety objectives would act as a constraint
2Ben Smith14dThat's right. What I mainly have in mind is a vector of Q-learned values V and a scalarization function that combines them in some (probably non-linear) way. Note that in our technical work, the combination occurs during action selection, not during reward assignment and learning. I guess whether one calls this "multi-objective RL" is semantic. Because objectives are combined during action selection, not during learning itself, I would not call it "single objective RL with a complicated objective". If you combined objectives during reward, then I could call it that. re: your example of real-time control during hunger, I think yours is a pretty reasonable model. I haven't thought about homeostatic processes in this project (my upcoming paper is all about them!). Definitely am not suggesting that our particular implementation of "MORL" (if we can call it that) is the only or even the best sort of MORL. I'm just trying to get started on understanding it! I really like the way you put it. It makes me think that perhaps the brain is a sort of multi-objective decision-making system with no single combinatory mechanism at all except for the emergent winner of whatever kind of output happens in a particular context--that could plausibly be different depending on whether an action is moving limbs, talking, or mentally setting an intention for a long term plan.
Force neural nets to use models, then detect these

I was writing a kinda long reply but maybe I should first clarify: what do you mean by "model"? Can you give examples of ways that I could learn something (or otherwise change my synapses within a lifetime) that you wouldn't characterize as "changes to my mental model"? For example, which of the following would be "changes to my mental model"?

  1. I learn that Brussels is the capital of Belgium
  2. I learn that it's cold outside right now
  3. I taste a new brand of soup and find that I really like it
  4. I learn to ride a bicycle, including
    1. maintaining balance via fast hard-to
... (read more)
2Stuart Armstrong20dVertigo, lust, pain reactions, some fear responses, and so on, don't involve a model. Some versions of "learning that it's cold outside" don't involve a model, just looking out and shivering; the model aspect comes in when you start reasoning about what to do about it. People often drive to work without consciously modelling anything on the way. Think model-based learning versus Q-learning. Anything that's more Q-learning is not model based.
Force neural nets to use models, then detect these

This one kinda confuses me. I'm of the opinion that the human brain is "constructed with a model explicitly, so that identifying the model is as simple as saying "the model is in this sub-module, the one labelled 'model'"." Of course the contents of the model are learned, but I think the question of whether any particular plastic synapse is or is not part of the information content of the model will have a straightforward yes-or-no answer. If that's right, then "it's hard to find the model (if any) in a trained model-free RL agent" is a disanalogy to "AIs ... (read more)

4Stuart Armstrong20dI don't think it has an easy yes or no answer (at least without some thought as to what constitutes a model within the mess of human reasoning) and I'm sure that even if it does, it's not straightforward. One hope would be that, by the time we have those technologies, we'd know what to look for.
My take on Vanessa Kosoy's take on AGI safety

Ah, you mean that "alignment" is a different problem than "subhuman and human-imitating training safety"? :P

"Quantilizing from the human policy" is human-imitating in a sense, but also superhuman. At least modestly superhuman - depends on how hard you quantilize. (And maybe very superhuman in speed.)

If you could fork your brain state to create an exact clone, would that clone be "aligned" with you? I think that we should define the word "aligned" such that the answer is "yes". Common sense, right?

Seems to me that if you say "yes it's aligned" to that qu... (read more)

2Charlie Steiner22dYup, I more or less agree with all that. The name thing was just a joke about giving things we like better priority in namespace. I think quantilization is safe when it's a slightly "lucky" human-imitation (also if it's a slightly "lucky" version of some simpler base distribution, but then it won't be as smart). But push too hard, which might not be very hard at all if you're iterating quantilization steps rather than quantilizing over a long-term policy, and instead you get an unaligned intelligence that happens to interact with the world by picking human-like behaviors that serve its purposes. (Vanessa pointed out to me that timeline-based DRL gets around the iteration problem because it relies on the human as an oracle for expected utility.)
Brain-inspired AGI and the "lifetime anchor"

The biologist answer there seems to be question-begging

Yeah, I didn't bother trying to steelman the imaginary biologist. I don't agree with them anyway, and neither would you.

(I guess I was imagining the biologist belonging to the school of thought (which again I strongly disagree with) that says that intelligence doesn't work by a few legible algorithmic principles, but is rather a complex intricate Rube Goldberg machine, full of interrelated state variables and so on. So we can't just barge in and make some major change in how the step-by-step operations... (read more)

Brain-inspired AGI and the "lifetime anchor"

Thanks, that's really helpful. I'm going to re-frame what you're saying in the form of a question:

The parallel-experiences question:

Take a model which is akin to an 8-year-old's brain. (Assume we deeply understand how the learning algorithm works, but not how the trained model works.) Now we make 10 identical copies of that model. For the next hour, we tell one copy to read a book about trucks, and we tell another copy to watch a TV show about baking, and we tell a third copy to build a sandcastle in a VR environment, etc. etc., all in parallel.

At the end ... (read more)

4gwern23dThe biologist answer there seems to be question-begging. What reason is there to think it isn't? Animals can't split and merge themselves or afford the costs or store datasets for exact replay etc, so they would be unable to do that whether or not it was possible, and so they provide zero evidence about whether their internal algorithms would be able to do it. You might argue that there might be multiple 'families' of algorithms all delivering animal-level intelligence, some of which are parallelizable and some not, and for lack of any incentive animals happened to evolve a non-parallelizable one, but this is pure speculation and can't establish that the non-parallelizable one is superior to the others (much less is the only such family). From the ML or statistics view, it seems hard for parallelization in learning to not be useful. It's a pretty broad principle that more data is better than less data. Your neurons are always estimating local gradients with whatever local learning rule they have, and these gradients are (extremely) noisy, and can be improved by more datapoints or rollouts to better estimate the update that jointly optimizes all of the tasks; almost by definition, this seems superior to getting less data one point at a time and doing noisy updates neglecting most of the tasks. If I am a DRL agent and I have n hypotheses about the current environment, why am I harmed by exploring all n in parallel with n copied agents, observing the updates, and updating my central actor with them all? Even if they don't produce direct gradients (let's handwave an architecture where somehow it'd be bad to feed them all in directly, maybe it's very fragile to off-policyness), they are still producing observations I can use to update my environment model for planning, and I can go through them and do learning before I take any more actions. (If you were in front of a death maze and were watching fellow humans run through it and get hit by the swinging blades or acid m
My take on Vanessa Kosoy's take on AGI safety

Hmm, thinking about it more I guess I'd say that "alignment" is not a binary. Like maybe:

  • Good alignment: Algorithm is helping the human work faster and better, while not doing anything more dangerous than what the human would have done by themselves without AI assistance
  • Even better alignment: Algorithm is trying to maximize the operator's synthesized preferences / trying to implement CEV / whatever.

One thing is, there's a bootstrapping thing here: if AI alignment researchers had AIs with "good alignment", that would help them make AIs with "even better ali... (read more)

1Vladimir Nesov23dI'd say alignment should be about values, so only your "even better alignment" qualifies. The non-agentic AI safety concepts like corrigibility, that might pave the way to aligned systems if controllers manage to keep their values throughout the process, are not themselves examples of alignment.
2Charlie Steiner23dAh, you mean that "alignment" is a different problem than "subhuman and human-imitating training safety"? :P So is there a continuum between category 1 and category 2? The transitional fossils could be non-human-imitating AIs that are trying to be a little bit general or have goals that refer to a model of the human a little bit, but the designers still understand the search space better than the AIs.
Collection of arguments to expect (outer and inner) alignment failure?

I have 5 "inner" and 2 "outer" arguments in bullet point lists at My AGI Threat Model: Misaligned Model-Based RL Agent (although you'll notice that my threat model winds up with "outer" & "inner" referring to slightly different things than the way most people around here use the terms).

Goodhart Ethology

Oh, yup, makes sense thanks

2Charlie Steiner1monp, I'm just glad someone is reading/commenting :)
Goodhart Ethology

now suppose this curve represents the human ratings of different courses of action, and you choose the action that your model says will have the highest rating. You're going to predictably mess up again, because of the optimizer's curse (or regressional Goodhart on the correlation between modeled rating and actual rating).

It's not obvious to me how the optimizer's curse fits in here (if at all). If each of the evaluations has the same noise, then picking the action that the model says will have the highest rating is the right thing to do. The optimizer's c... (read more)

2Charlie Steiner1moYeah, this is right. The variable uncertainty comes in for free when doing curve fitting - close to the datapoints your models tend to agree, far away they can shoot off in different directions. So if you have a probability distribution over different models, applying the correction for the optimizer's curse has the very sensible effect of telling you to stick close to the training data.
Paths To High-Level Machine Intelligence

This is great! Here's a couple random thoughts:

Hybrid statistical-symbolic HLMI

I think there's a common habit of conflating "symbolic" with "not brute force" and "not bitter lesson" but that it's not right. For example, if I were to write an algorithm that takes a ton of unstructured data and goes off and builds a giant PGM that best explains all that data, I would call that a "symbolic" AI algorithm (because PGMs are kinda discrete / symbolic / etc.), but I would also call it a "statistical" AI algorithm, and I would certainly call it "compatible with The... (read more)

3Daniel_Eth1moThanks! I agree that symbolic doesn't have to mean not bitter lesson-y (though in practice I think there are often effects in that direction). I might even go a bit further than you here and claim that a system with a significant amount of handcrafted aspects might still be bitter lesson-y, under the right conditions. The bitter lesson doesn't claim that the maximally naive and brute-force method possible will win, but instead that, among competing methods, more computationally-scalable methods will generally win over time (as compute increases). This shouldn't be surprising, as if methods A and B were both appealing enough to receive attention to begin with, then as compute increases drastically, we'd expect the method of the two that was more compute-leveraging to pull ahead. This doesn't mean that a different method C, which was more naive/brute-force than either A or B, but wasn't remotely competitive with A and B to begin with, would also pull ahead. Also, insofar as people are hardcoding in things that do scale well with compute (maybe certain types of biases, for instance), that may be more compatible with the bitter lesson than, say, hardcoding in domain knowledge. Part of me also wonders what happens to the bitter lesson if compute really levels off. In such a world, the future gains from leveraging further compute don't seem as appealing, and it's possible larger gains can be had elsewhere.
Research agenda update

Good question!

Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as "goals", and then makes plans to advance those "goals". (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is "aligned" if the things flagged as "goals" do in fact corresponding to maximizing the objective function (e.g. "predict the human's outputs"), or at least it's as close a match as anything in the world-model, and if this remains true even as the world-model gets imp... (read more)

1Vanessa Kosoy1moThe way I think about instrumental goals is: You have have an MDP with a hierarchical structure (i.e. the states are the leaves of a rooted tree), s.t. transitions between states that differ on a higher level of the hierarchy (i.e. correspond to branches that split early) are slower than transitions between states that differ on lower levels of the hierarchy. Then quasi-stationary distributions on states resulting from different policies on the "inner MDP" of a particular "metastate" effectively function as actions w.r.t. to the higher levels. Under some assumptions it should be possible to efficiently control such an MDP in time complexity much lower than polynomial in the total number of states[1] [#fn-SFtFyxPrB8bLrX6NL-1]. Hopefully it is also possible to efficiently learn this type of hypothesis. I don't think that anywhere there we will need a lemma saying that the algorithm picks "aligned" goals. -------------------------------------------------------------------------------- 1. For example, if each vertex in the tree has the structure of one of some small set of MDPs, and you are given mappings from admissible distributions on "child" MDPs to actions of "parent" MDP that is compatible with the transition kernel. ↩︎ [#fnref-SFtFyxPrB8bLrX6NL-1]
Research agenda update

Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled "Lemma 1: This learning algorithm will be aligned".

The reason I think that is that (as above) I expect the learning algorithms in question to be kin... (read more)

2Vanessa Kosoy2moI don't understand what Lemma 1 is if it's not some kind of performance guarantee. So, this reasoning seems kinda circular. But, maybe I misunderstand.
Research agenda update

Hmmm, OK, let me try again.

You wrote earlier: "the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally".

My claim is that good-enough algorithms for "adding more and more detail incrementally" will also incidentally (by default) be algorithms that seize control of their off-switches.

And the reason I put a lot of weight on this claim is that I think the best algorithms for "adding more and more detail incrementally" may be algorithms that are (loosely speaking) "trying" to understand a... (read more)

2Vanessa Kosoy2moI never said it's not a safety problem. I only said that a lot progress on this can come from research that is not very "safety specific". I would certainly work on it if "precisely defining safe" was already solved. Yes, we don't have these things. That doesn't mean these things don't exist. Surely all research is about going from "not having" things to "having" things? (Strictly speaking, it would be very hard to literally have a performance guarantee about the brain since the brain doesn't have to be anything like a "clean" implementation of a particular algorithm. But that's besides the point.)
Research agenda update

Thanks!! Here's where I'm at right now.

In the grandparent comment I suggested that if we want to make an AI that can learn sufficiently good hypotheses to do human-level things, perhaps the only way to do that is to make a "prior-building AI" with "agency" that is "trying" to build out its world-model / toolkit-of-concepts-and-ideas in fruitful directions. And I said that we have to solve the problem of how to build that kind of agential "prior-building AI" that doesn't also incidentally "try" to seize control of its off-switch.

Then in the parent comment y... (read more)

2Vanessa Kosoy2moUmm, obviously I did not claim it isn't. I just decomposed the original problem in a different way that didn't single out this part. Maybe? I'm not quite sure what you mean by "prior building AI" and whether it's even possible to apply a "microscope" to something superhuman, or that this approach is easier than other approaches, but I'm not necessarily ruling it out. That's where our major disagreement is, I think. I see human brains as evidence such algorithms exist and deep learning as additional evidence. We know that powerful learning algorithms exist. We know that no algorithm can learn anything (no free lunch). What we need is a mathematical description of the space of hypotheses these algorithms are good at, and associated performance bounds. The enormous generality of these algorithms suggests that there probably is such a simple description. I don't understand your argument here. When I prove a theorem that "for all x: P(x)", I don't need to be able to imagine every possible value of x. That's the power of abstraction. To give a different example, the programmers of AlphaGo could not possibly anticipate all the strategies it came up or all the life and death patterns it discovered. That wasn't a problem for them either.
Can you get AGI from a Transformer?

Thanks for the comment!

First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.

Yeah even when I wrote this, I had already seen claims that ... (read more)

4gwern2moYeah, I didn't want to just nitpick over "is this tree search a MCTS or not", which is why I added in #2-4, which address the steelman - even if you think MuZero is using MCTS, I think that doesn't matter because one doesn't need any tree search at all, so a fortiori that question doesn't matter. (I also think the MuZero paper is generally confusing and poorly-written, and that's where a lot of confusion is coming from. I am not the only person to read it through several times and come away confused about multiple things, and people trying to independently reimplement MuZero tell me that it seems to leave out a lot of details. There's been multiple interesting followup papers, so perhaps reading them all together would clarify things.) -------------------------------------------------------------------------------- Yes, so on your spectrum of #1-6, I would put myself at closer to 3 than 2. I would say that while we have the global compute capacity now to scale up what are the moral equivalents of contemporary models to what the scaling laws would predict is human-equivalence (assuming, as seems likely but far from certain, that they more or less hold - we haven't seen any scaling law truly break yet), at the hundreds of trillions to quadrillion parameter regime of Transformers or MLPs, this is only about the compute for a single training run. The hardware exists and the world is wealthy enough to afford it if it wanted to (although it doesn't). But we actually need the compute for the equivalent of many runs. The reason hardware progress drives algorithmic software progress is because we are absolutely terrible at designing NNs, and are little more than monkeys banging at giant black boxes with trial-and-error, confabulating or retrospectively cherrypicking theories to explain the observed results. Thus we need enough compute to blow on enough runs that a grad student can go 'what if I added a shortcut connection? Oh' or 'these MLP things never work beyond 3 or
Information At A Distance Is Mediated By Deterministic Constraints

I'm sure you already know this, but information can also travel a large distance in one hop, like when I look up at night and see a star. Or if someone 100 years ago took a picture of a star, and I look at the picture now, information has traveled 110 years and 10 light-years in just two hops.

But anyway, your discussion seems reasonable AFAICT for the case you're thinking of.

5johnswentworth2moWe can still view these as travelling through many layers - the light waves have to propagate through many lightyears of mostly-empty space (and it could attenuate or hit things along the way). The photo has to last many years (and could randomly degrade a little or be destroyed at any moment along the way). What makes it feel like "one hop" intuitively is that the information is basically-perfectly conserved at each "step" through spacetime, and there's in a symmetry in how the information is represented.
Research agenda update

why do we "still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it"? The objective is fully specified…

Thanks, that was a helpful comment. I think we're making progress, or at least I'm learning a lot here. :)

I think your perspective is: we start with a prior—i.e. the prior is an ingredient going into the algorithm. Whereas my perspective is: to get to AGI, we need an agent to build the prior, so to speak. And this agent can be dangerous.

So for example, let's talk ... (read more)

2Vanessa Kosoy2moI think the confusion here comes from mixing algorithms with desiderata. HDTL is not an algorithm, it is a type of desideratum than an algorithm can satisfy. "the AI's prior has a combinatorial explosion" is true but "dumb process of elimination" is false. A powerful AI has to be have a very rich space of hypotheses it can learn. But this doesn't mean this space of hypotheses is explicitly stored in its memory or anything of the sort (which would be infeasible). It only means that the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally (which might correspond to refinement in the infra-Bayesian sense). My thesis here is that if the AI satisfies a (carefully fleshed out in much more detail) version of the HDTL desideratum, then it is safe and capable. How to make an efficient algorithm that satisfies such a desideratum is another question, but that's a question from a somewhat different domain: specifically the domain of developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms. I see the latter effort as to first approximation orthogonal to the effort of finding good formal desiderata for safe TAI (and, it also receives plenty of attention from outside the existential safety community).
Research agenda update

Thanks again for your very helpful response! I thought about the quantilization thing more, let me try again.

As background, to a first approximation, let’s say 5 times per second I (a human) “think a thought”. That involves a pair of two things:

  • (Possibly) update my world-model
  • (Possibly) take an action—in this case, type a key at the keyboard

Of these two things, the first one is especially important, because that’s where things get "figured out". (Imagine staring into space while thinking about something.)

OK, now back to the AI. I can broadly imagine two st... (read more)

2Vanessa Kosoy2moI gave a formal mathematical definition [] of (idealized) HDTL, so the answer to your question should probably be contained there. But I'm not entirely sure what it is since I don't entirely understand the question. The AI has a "superior epistemic vantage point" in the sense that, the prior ζ is richer than the prior that humans have. But, why do we "still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it"? The objective is fully specified. A possible interpretation of your argument: a powerful AI would have to do something like TRL [] and access to the "envelope" computer can be unsafe in itself, because of possible side effects. That's truly a serious problem! Essentially, it's non-Cartesian daemons [] . Atm I don't have an extremely good solution to non-Cartesian daemons. Homomorphic cryptography can arguably solve it, but there's large overhead. Possibly we can make do with some kind of obfuscation instead. Another vague idea I have is, make the AI avoid running computations which have side-effects predictable by the AI. In any case, more work is needed. I don't see why is it especially hard, it seems just like any system with unobservable degrees of freedom, which covers just about anything in the real world. So I would expect an AI with transformative capability to be able to do it. But maybe I'm just misunderstanding what you mean by this "approach number 2". Perhaps you're saying that it's not enough to accurately predict the human actions, we need to have accurate pointers to particular gears inside the model. But I don't think we do (maybe i
Research agenda update

Thanks! I'm still thinking about this, but quick question: when you say "AIT definition of goal-directedness", what does "AIT" mean?

2Vanessa Kosoy2moAlgorithmic Information Theory
Alignment Research = Conceptual Alignment Research + Applied Alignment Research

Hmmm, maybe your distinction is something like "conceptual" = "we are explicitly and unabashedly talking about AGI & superintelligence" and "applied" = "we're mainly talking about existing algorithms but hopefully it will scale"??

2Adam Shimi2moAgreed that the clusters look like that, but I'm not convinced it's the most relevant point. The difference of methods seems important too.
Research agenda update

Thanks!!! After reading your comment and thinking about it more, here's where I'm at:

Your "demonstration" thing was described as "The [AI] observes a human pursuing eir values and deduces the values from the behavior."

When I read that, I was visualizing a robot and a human standing in a room, and the human is cooking, and the robot is watching the human and figuring out what the human is trying to do. And I was thinking that there needs to be some extra story for how that works, assuming that the robot has come to understand the world by building a giant u... (read more)

1Vanessa Kosoy2moWell, one thing you could try is using the AIT definition of goal-directedness [] to go from the policy to the utility function. However, in general it might require knowledge of the human's counterfactual behavior which the AI doesn't have. Maybe there are some natural assumption under which it is possible, but it's not clear. I feel the appeal of this intuition, but on the other hand, it might be a much easier problem since both of you are humans doing fairly "normal" human things. It is less obvious you would be able to watch something completely alien and unambiguously figure out what it's trying to do. To first approximation, it is enough for the AI to be more capable than us, since, whatever different solution we might come up with, an AI which is more capable than us would come up with a solution at least as good. Quantilizing from an imitation baseline seems like it should achieve that, since the baseline is "as capable as us" and arguably quantilization would produce significant improvement over that. Instead of "actions the AI has seen a human take", a better way to think about it is "actions the AI can confidently predict a human could take (with sufficient probability)".
Multi-dimensional rewards for AGI interpretability and control

Thanks for your comment! I don't exactly agree with it, mostly because I think "model-based" and "model-free" are big tents that include lots of different things (to make a long story short). But it's a moot point anyway because after writing this I came to believe that the brain is in fact using an algorithm that's spiritually similar to what I was talking about in this post.

Buck's Shortform

I wonder what you mean by "competitive"? Let's talk about the "alignment tax" framing. One extreme is that we can find a way such that there is no tradeoff whatsoever between safety and capabilities—an "alignment tax" of 0%. The other extreme is an alignment tax of 100%—we know how to make unsafe AGIs but we don't know how to make safe AGIs. (Or more specifically, there are plans / ideas that an unsafe AI could come up with and execute, and a safe AI can't, not even with extra time/money/compute/whatever.)

I've been resigned to the idea that an alignment ta... (read more)

Vanessa Kosoy's Shortform

(Update: I don't think this was 100% right, see here for a better version.)

Attempted summary for morons like me: AI is trying to help the human H. They share access to a single output channel, e.g. a computer keyboard, so that the actions that H can take are exactly the same as the actions AI can take. Every step, AI can either take an action, or delegate to H to take an action. Also, every step, H reports her current assessment of the timeline / probability distribution for whether she'll succeed at the task, and if so, how soon.

At first, AI will probably... (read more)

2Vanessa Kosoy2moThis is about right. Notice that typically we use the AI for tasks which are hard for H. This means that without the AI's help, H's probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important. On a typical task, H expects to fail eventually but they don't expect to fail soon. Therefore, the AI can safely consider a policies of the form "in the short-term, do something H would do with marginal probability, in the long-term go back to H's policy". If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it's possible that in the new prognosis H still doesn't expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.
Analogies and General Priors on Intelligence

I guess I'd just suggest that in "ML exhibits easy marginal intelligence improvements", you should specify whether the "ML" is referring to "today's ML algorithms" vs "Whatever ML algorithms we're using in HLMI" vs "All ML algorithms" vs something else (or maybe you already did say which it is but I missed it).

Looking forward to the future posts :)

Analogies and General Priors on Intelligence

I feel like "ML exhibits easy marginal intelligence improvements" is maybe not exactly hitting the nail on the head, in terms of the bubbles that feed into it. Maybe it should be something like:

  • "Is There One Big Breakthrough Insight that leads to HLMI and beyond?" (or a handful of insights, but not 10,000 insights).
  • Has that One Big Breakthrough Insight happened yet?

If you think that there's an insight and it's already happened, then you would think today's ML systems exhibit easy marginal intelligence improvements (scaling hypothesis). If you think there's... (read more)

2Samuel Dylan Martin2moThe 'one big breakthrough' idea is definitely a way that you could have easy marginal intelligence improvements at HLMI, but we didnt't call the node 'one big breakthrough/few key insights needed' because that's not the only way it's been characterised. E.g. some people talk about a 'missing gear for intelligence', where some minor change that isn't really a breakthrough (like tweaking a hyperparameter in a model training procedure) produces massive jumps in capability. Like David said, there's a subsequent post where we go through the different ways the jump to HLMI could play out, and One Big Breakthrough (we call it 'few key breakthroughs for intelligence) is just one of them.
4David Manheim2moI mostly agree, but we get into the details of how we expect improvements can occur much more in the upcoming posts on paths to HLMI and takeoff speeds.
Analogies and General Priors on Intelligence

I interpreted that Yudkowsky tweet (on GPT-3 coding a React app) differently than you, I think.

I thought it was pertaining to modularity-of-intelligence and (relatedly) singleton-vs-multipolar. Specifically, I gather that part of the AI-foom debate was that Hanson expected AGI source code to be immensely complex and come from the accumulation of lots of little projects trading modules and ideas with each other:

The idea that you could create human-level intelligence by just feeding raw data into the right math-inspired architecture is pure fantasy. You coul

... (read more)
4Samuel Dylan Martin2moI agree that that was his object-level claim about GPT-3 coding a react app - that it's relatively simple and coherent and can acquire lots of different skills via learning, vs being a collection of highly specialised modules. And of relevance to this post, the first is a way that intelligence improvements could be easy, and the second is the way they could be hard. Our 'interpretation' was more about making explicit what the observation about GPT-3 was, If we'd continued that summary, it would have said something like what you suggested, i.e. Which takes the argument all the way through to the conclusion. Presumably the other interpretation of the shorter thing that we wrote is that HLMI/AGI is going to be an ML model that looks a lot like GPT-3, so improvements will be easy because HLMI will be similar to GPT-3 and scale up like GPT-3 (whether AGI/HLMI is like current ML will be covered in a subsequent post on paths to HLMI), whereas what's actually being focussed on is the general property of being a simple data-driven model vs complex collection of modules. We address the modularity question directly in the 'upper limit to intelligence' section that discusses modularity of mind.
Research agenda update

Well, I did try reading your posts 6 months ago, and I found them confusing, in large part because I was thinking about the exact problem I'm talking about here, and I didn't understand how your proposal would get around that problem or solve it. We had a comment exchange here somewhat related to that, but I was still confused after the exchange ... and it wound up on my to-do list ... and it's still on my to-do list to this day ... :-P

1Koen Holtman2moI know all about that kind of to-do list. Definitely my sequence of 6 months ago is not about doing counterfactual planning by modifying somewhat opaque million-node causal networks that might be generated by machine learning. The main idea is to show planning world model modifications that you can apply even when you have no way of decoding opaque machine-learned functions.
Research agenda update

Oh, I was thinking of the more specific mental operation "if it's undesirable for Alice to deceive Bob, then it's undesirable for me to deceive Bob (and/or it's undesirable for me to be deceived by Alice)". So we're not just talking about understanding things from someone's perspective, we're talking about changing your goals as a result. Anything that involves changing your goals is almost definitely not a convergent instrumental subgoal, in my view.

Example: Maybe I think it's good for spiders to eat flies (let's say for the sake of argument), and I can put myself in the shoes of a spider trying to eat flies, but doing that doesn't make me want to eat flies myself.

2Adam Shimi2moYeah, that's fair. Your example shows really nicely how you would not want to apply rules/reasons/incentives you derived to spiders to yourself. That also work with more straightforward agents, as most AIs wouldn't want to eat ice cream from seeing me eat some and enjoy it.
Research agenda update

Now, you are saying that My default presumption is that our AGIs will learn a world-model from scratch, i.e. learn their full world model from scratch. In this, you are following the prevailing fashion in theoretical (as opposed to applied) ML. But if you follow that fashion it will blind you to a whole class of important solutions for building learned world models with hardcoded pieces.

Just FYI, for me personally this presumption comes from my trying to understand human brain algorithms, on the theory that people could plausibly build AGIs using similar a... (read more)

1Koen Holtman2moThanks for clarifying. I see how you might apply a 'from scratch' assumption to the neocortex. On the other hand, if the problem is to include both learned and hard-coded parts in a world model, one might take inspiration from things like the visual cortex, from the observation that while initial weights in the visual cortex neurons may be random (not sure if this is biologically true though), the broad neural wiring has been hardcoded by evolution. In AI terminology, this wiring represents a hardcoded prior, or (if you want to take the stance that you are learning without a prior) a hyperparameter. The robots I am talking about were usually not completely blind, but they had very limited sensing capabilities. The point about hardcoding here is that the processing steps which turned sensor signals into world model details were often hardcoded. Other necessary world model details for which no sensors were available would have to be hardcoded as well. I do not think you not understand me correctly. You are assuming I am talking about handcoding giant networks where each individual node might encode a single basic concept like a dowsing rod, and then ML may even add more nodes dynamically. This is not at all what the example networks I linked to look like, and not at all how ML works on them. Look, I included this link to the sequence to clarify exactly what I mean: please click the link and take a look. The planning world causal graphs you see there are not world models for toy agents in toy worlds, they are plausible AGI agent world models. A single node typically represents a truly giant chunk of current or future world state. The learned details of a complex world are all inside the learned structural functions, in what I call the model parameter L in the sequence. The linked-to approach is not the only way to combine learned and hardcoded model parts, but think it shows very useful technique. My more general point is also that there are a lot of not-in-fashio
Research agenda update

You write that you don't really know what the theory entails for AGI's consciousness, so isn't the actual application for the meta-problem of consciousness still wide open?

I feel like I have a pretty good grasp on the solution to the meta-problem of consciousness but that I remain pretty confused and unsatisfied about the hard problem of consciousness. This is ironic because I was just saying that the hard problem should be relatively straightforward once you have the meta-problem nailed down. But "relatively straightforward" is still not trivial, especial... (read more)

Research agenda update

we want some guarantees that the AGI will learn to put itself in the place of the agent doing the bad thing. It's possible that it happens by default, but we don't have any argument for that

Yeah, I mean, the AGI could "put itself in the place of" Alice, or Bob, or neither. My pretty strong belief is that by default the answer would be "neither", unless of course we successfully install human-like social instincts. I think "putting ourselves in the place of X" is a very specific thing that our social instincts make us want to do (sometimes), I don't think it happens naturally.

2Adam Shimi2moOkay, so we have a crux in "putting ourselves in the place of X isn't a convergent subgoals". I need to think about it, but I think I recall animal cognition experiments which tested (positively) something like that in... crows? (and maybe other animals).
Research agenda update

It's not logically inconsistent for an AGI to think "it's bad for Alice to deceive Bob but good for me to deceive Bob", right?

I do kinda like the idea of getting AIs to follow human norms. If we can successfully do that, then the AI would automatically turn "Alice shouldn't deceive Bob" into at least weak evidence for "I shouldn't deceive Bob". But how do we make AIs that want to follow human norms in the first place? I feel like solving the 1st-person problem would help to do that.

Another issue is that we may in fact want the AI to apply different standar... (read more)

2Adam Shimi3moRephrasing it, you mean that we want some guarantees that the AGI will learn to put itself in the place of the agent doing the bad thing. It's possible that it happens by default, but we don't have any argument for that, so let's try solving the problem by transforming its knowledge into 1st person knowledge. Is that right? Fair enough, I hadn't thought about that.
LCDT, A Myopic Decision Theory

For the world-model, it's not actually incoherent because we cut the link and update the distribution of the subsequent agent.

I'm gonna see if I can explain this in more detail—you can correct me if I'm wrong.

In common sense, I would say "Suppose I burn the kite. What happens in the future? Is it good or bad? OK, suppose I don't burn the kite. What happens in the future? Is it good or bad?" And then decide on that basis.

But that's EDT.

CDT is different.

In CDT I can have future expectations that follow logically from burning the kite, but they don't factor i... (read more)

1Adam Shimi3moIn addition to Evan's answer (with which I agree), I want to make explicit an assumption I realized after reading your last paragraph: we assume that the causal graph is the final result of the LCDT agent consulting its world model to get a "model" of the task at hand. After that point (which includes drawing causality and how the distributions impacts each other, as well as the sources' distributions), the LCDT agent only decides based on this causal graph. In this case it cuts the causal links to agent and then decide CDT style. None of this result in an incoherent world model because the additional knowledge that could be used to realize that the cuts are not "real", is not available in the truncated causal model, and thus cannot be accessed while making the decision. I honestly feel this is the crux of our talking past each other (same with Joe) in the last few comments. Do you think that's right?

Specifically, I think if an agent can do the kind of reasoning that would allow it to create a causal world-model in the first place, then the same kind of reasoning would lead it to realize that there is in fact supposed to be a link at each of the places where we manually cut it—i.e., that the causal world-model is incoherent.

An LCDT agent should certainly be aware of the fact that those causal chains actually exist—it just shouldn't care about that. If you want to argue that it'll change to not using LCDT to make decisions anymore, you have to argue ... (read more)

Thoughts on safety in predictive learning


Or do you mean literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive?


If row-hammering (or whatever) improves the loss, then the gradient will push in that direction.

I don't think this is true in the situation I'm talking about ("literally that the world-model uses row-hammer on the computer it runs, to make the supervisory signal positive").

Let's say we have weights θ, and loss is nominally the function f(θ), but the actual calculated loss is F(θ). Normally f(θ)=F(θ), but there are certai... (read more)

1Adam Shimi3moExplained like that, it makes sense. And that's something I hadn't thought about. Completely agree. This is part of my current reasoning for why GPT-3 (and maybe GPT-N) aren't incentivized for predict-o-matic behavior. I'm confused by that paragraph: you sound like you're saying that the postdictive learner would not see itself as outside the universe in one sentence and would do so in the next. Either way, it seemed linked with the 1st person problem we're discussing in your research update [] : this is a situation where you seem to expect that the translation into 1st person knowledge isn't automatic, and so can be controlled, incentivized or not.
LCDT, A Myopic Decision Theory


I think I get the intuition behind the "paperclip factory" thing:

Suppose we design the LCDT agent with the "prior" that "After this decision right now, I'm just going to do nothing at all ever again, instead I'm just going to NOOP until the end of time." And we design it to never update away from that prior. In that case, then the LCDT agent will not try to execute multi-step plans.

Whereas if the LCDT agent has the "prior" that it's going to make future decisions using a similar algorithm as what it's using now, then it would do the first step of a mu... (read more)

4Adam Shimi3moThanks for the comment! Your explanation of the paperclip factory is spot on. That being said, it is important to precise that the link to building the factory must have no agent in it, or the LCDT agent would think its actions doesn't change anything. The weird part (that I don't personally know how to address) is deciding where the prior comes from. Most of the post argues that it doesn't matter for our problems, but in this example (and other weird multi-step plans, it does. That's a fair concern. Our point in the post is that LCDT can think things through when simulating other systems (like HCH) for imitating them. And so it should have strong capabilities there. But you're right that its an issue for long term planning if we expect an LCDT agent to directly solve problems. The technical answer is that the LCDT agent computes its distribution over actions spaces for the human by marginalizing the human's current distribution with the LCDT agent distribution over its own action. The intuition is something like: "I believe that the human has already some model of which action I will take, and nothing I can do will change that".
Garrabrant and Shah on human modeling in AGI

Thanks for transcribing! I thought this was a helpful and interesting discussion, and I'm happy to be able to refer back to it more easily.

Cheers, "Steve [inaudible]" :-)

2Rob Bensinger3moEdited!
Big picture of phasic dopamine

I agree about the general principle, even if I don't think this particular thing is an example because of the "not maximizing sum of future rewards" thing.

Big picture of phasic dopamine

Hmm, I guess I mostly disagree because:

  • I see this as sorta an unavoidable aspect of how the system works, so it doesn't really need an explanation;
  • You're jumping to "the system will maximize sum of future rewards" but I think RL in the brain is based on "maximize rewards for this step right now" (…and by the way "rewards for this step right now" implicitly involves an approximate assessment of future prospects.) See my comment "Humans are absolute rubbish at calculating a time-integral of reward".
  • I'm all for exploration, value-of-information, curiosity, etc., just not involving this particular mechanism.
1Matthew "Vaniver" Graves3moI guess my sense is that most biological systems are going to be 'package deals' instead of 'cleanly separable' as much as possible--if you already have a system that's doing learning, and you can tweak that system in order to get something that gets you some of the benefits of a VoI framework (without actually calculating VoI), I expect biology to do that.
Reward splintering for AI design

The way I'm thinking about AGI algorithms (based on how I think the neocortex works) is, there would be discrete "features" but they all come in shades of applicability from 0 to 1, not just present or absent. And by the same token, the reward wouldn't perfectly align with any "features" (since features are extracted from patterns in the environment), and instead you would wind up with "features" being "desirable" (correlated with reward) or "undesirable" (anti-correlated with reward) on a continuous scale from -∞ to +∞. And the agent would try to bring ab... (read more)

Thoughts on safety in predictive learning

One thing is, I'm skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there's a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don't think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):

The robot is supposed to be really good at manufacturing widgets (for example), and that task requires real-world foresighted planning, because sometimes it needs to s

... (read more)
2Evan Hubinger3moYup, that's basically my objection.
Thoughts on safety in predictive learning

I think it can be simultaneously true that, say:

  • "weight #9876 is 1.2345 because out of all possible models, the highest-scoring model is one where weight #9876 happens to be 1.2345"
  • "weight #9876 is 1.2345 because the hardware running this model has a RowHammer vulnerability, and this weight is part of a strategy that exploits that. (So in a counterfactual universe where we made chips slightly differently such that there was no such thing as RowHammer, then weight #9876 would absolutely NOT be 1.2345.)"

The second one doesn't stop being true because the firs... (read more)

2Evan Hubinger3moSure, that's fair. But in the post, you argue that this sort of non-in-universe-processing won't happen because there's no incentive for it: However, if there's another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.
Thoughts on safety in predictive learning

I think you're misunderstanding (or I am).

I'm trying to make a two step argument:

(1) SGD under such-and-such conditions will lead to a trained model that does exclusively within-universe processing [this step is really just a low-confidence hunch but I'm still happy to discuss and defend it]

(2) trained models that do exclusively within-universe processing are not scary [this step I have much higher confidence in]

If you're going to disagree with (2), then SGD / "what the model was selected" for is not relevant.

"Doing exclusively within-universe processing" ... (read more)

2Evan Hubinger3moI mean, I guess it depends on your definition of “unrelated to any anticipated downstream real-world consequences.” Does the reason “it's the simplest way to solve the problem in the training environment” count as “unrelated” to real-world consequences? My point is that it seems like it should, since it's just about description length, not real-world consequences—but that it could nevertheless yield arbitrarily bad real-world consequences.
Load More