All of Steven Byrnes's Comments + Replies

Looking Deeper at Deconfusion

Is there any good AI alignment research that you don't classify as deconfusion? If so, can you give some examples?

1Adam Shimi2hSure. * Any proposed solution to AI Alignment isn't deconfusion. It might have a bit a deconfusion at the start, and maybe studying it reveal new confusion to solve, but most of it is problem solving instead of deconfusion. * IDA [] * Debate [] * Recursive Reward Modeling [] * 11 proposals [] * Work in interpretability might involve deconfusion (to clarify what one searches for), but then isn't deconfusion anymore. * Circuits [] * Neural net generalization [] * Knowledge Neurons in Transformers [] * Just like in normal science, once one has defined a paradigm or a problem, working with it is mostly not deconfusion anymore: * John's work on the natural abstraction hypothesis [] * Alex's use of POWER to study instrumental power-seeking [] All in all, I think there are many more examples. It's just that deconfusion almost always plays a part, because we don't have one unified paradigm or approach which does the deconfusion for us. But actual problem solving and most part of normal science, are not deconfusion by my perspective.
The Credit Assignment Problem

a system which needs a protected epistemic layer sounds suspiciously like a system that can't tile

I stand as a counterexample: I personally want my epistemic layer to have accurate beliefs—y'know, having read the sequences… :-P

I think of my epistemic system like I think of my pocket calculator: a tool I use to better achieve my goals. The tool doesn't need to share my goals.

The way I think about it is:

  • Early in training, the AGI is too stupid to formulate and execute a plan to hack into its epistemic level.
  • Late in training, we can hopefully get to the place
... (read more)
Big picture of phasic dopamine

I'm proposing that (1) the hypothalamus has an input slot for "flinch now", (2) VTA has an output signal for "should have flinched", (3) there is a bundle of partially-redundant side-by-side loops (see the "probability distribution" comment) that connect specifically to both (1) and (2), by a genetically-hardcoded mechanism.

I take your comment to be saying: Wouldn't it be hard for the brain to orchestrate such a specific pair of connections across a considerable distance?

Well, I'm very much not an expert on how the brain wires itself up. But I think there'... (read more)

The reverse Goodhart problem

Let me try to repair Goodhart's law to avoid these problems:

By statistics, we should very generally expect two random variables to be uncorrelated unless there's a "good reason" to expect them to be correlated. Goodhart's law says that if U and V are correlated in some distribution, then (1) if a powerful optimizer tries to maximize U, then it will by default go far out of the distribution, (2) the mere fact that U and V were correlated in the distribution does not in itself constitute a "good reason" to expect them to be correlated far out of the distribu... (read more)

Big picture of phasic dopamine

Right, so I'm saying that the "supervised learning loops" get highly specific feedback, e.g. "if you get whacked in the head, then you should have flinched a second or two ago", "if a salty taste is in your mouth, then you should have salivated a second or two ago", "if you just started being scared, then you should have been scared a second or two ago", etc. etc. That's the part that I'm saying trains the amygdala and agranular prefrontal cortex.

Then I'm suggesting that the Success-In-Life thing is a 1D reward signal to guide search in a high-dimensional ... (read more)

1Charlie Steiner4dHow does the section of the amygdala that a particular dopamine neuron connects to even get trained to do the right thing in the first place? It seems like there should be enough chance in connections that there's really only this one neuron linking a brainstem's particular output to this specific spot in the amygdala - it doesn't have a whole bundle of different signals available to send to this exact spot. SL in the brain seems tricky because not only does the brainstem have to reinforce behaviors in appropriate contexts, it might have to train certain outputs to correspond to certain behaviors in the first place, all with only one wire to each location! Maybe you could do this with a single signal that means both "imitate the current behavior" and also "learn to do your behavior in this context"? Alternatively we might imagine some separate mechanism for of priming the developing amygdala to start out with a diverse yet sensible array of behavior proposals, and the brainstem could learn what its outputs correspond to and then signal them appropriately.
Big picture of phasic dopamine

That's interesting, thanks!

good/bad/neutral is a thing, but it seems to be defined largely with respect to our expectation of what was going to happen in the situation we were in.

I agree that this is a very important dynamic. But I also feel like, if someone says to me, "I keep a kitten in my basement and torture him every second of every day, but it's no big deal, he must have gotten used to it by now", I mean, I don't think that reasoning is correct, even if I can't quite prove it or put my finger on what's wrong. I guess that's what I was trying to get ... (read more)

Big picture of phasic dopamine


If you Ctrl-F the post you'll find my little paragraph on how my take differs from Marblestone, Wayne, Kording 2016.

I haven't found "meta-RL" to be a helpful way to frame either the bandit thing or the follow-up paper relating it to the brain, more-or-less for reasons here, i.e. that the normal RL / POMDP expectation is that actions have to depend on previous observations—like think of playing an Atari game—and I guess we can call that "learning", but then we have to say that a large fraction of every RL paper ever is actually a meta-RL paper, and m... (read more)

1Michaël Trazzi5dRight I just googled Marblestone and so you're approaching it with the dopamine side and not the acetylcholine. Without debating about words, their neuroscience paper is still at least trying to model the phasic dopamine signal as some RPE & the prefrontal network as an LSTM (IIRC), which is not acetylcholine based. I haven't read in detail this post & the one linked, I'll comment again when I do, thanks!
Big picture of phasic dopamine

The least-complicated case (I think) is: I (tentatively) think that the hippocampus is more-or-less a lookup table with a finite number of discrete thoughts / memories / locations / whatever (the type of content in different in different species), and a "proposal" is just "which of the discrete things should be activated right now". 

A medium-difficulty case is: I think motor cortex stores a bunch of sequences of motor commands which execute different common action sequences. (I'm a believer in the Graziano theory that primary motor cortex, secondary m... (read more)

Dangerous optimisation includes variance minimisation

I agree! I'm 95% sure this is in Superintelligence somewhere, but nice to have a more-easily-linkable version.

My AGI Threat Model: Misaligned Model-Based RL Agent

it's all a big mess

Yup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump, I guess.) In particular, just like prosaic AGI alignment, my starting point is not "Building this kind of AGI is a great idea", but rather "This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I'm correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there's some chance they'll succeed whe... (read more)

An Intuitive Guide to Garrabrant Induction

Sorry if this is a stupid question but wouldn't "LI with no complexity bound on the traders" be trivial? Like, there's a noncomputable trader (brute force proof search + halting oracle) that can just look at any statement and immediately declare whether it's provably false, provably true, or neither. So wouldn't the prices collapse to their asymptotic value after a single step and then nothing else ever happens?

2Vanessa Kosoy10dFirst, "no complexity bounds on the trader" doesn't mean we allow uncomputable traders, we just don't limit their time or other resources (exactly like in Solomonoff induction). Second, even having a trader that knows everything doesn't mean all the prices collapse in a single step. It does mean that the prices will converge to knowing everything with time. GI guarantees no budget-limited trader will make an infinite profit, it doesn't guarantee no trader will make a profit at all (indeed guaranteeing the later is impossible).
My AGI Threat Model: Misaligned Model-Based RL Agent

Hi again, I finally got around to reading those links, thanks!

I think what you're saying (and you can correct me) is: observation-utility agents are safer (or at least less dangerous) than reward-maximizers-learning-the-reward, because the former avoids falling prey to what you called "the easy problem of wireheading".

So then the context was:

First you said, If we do rollouts to decide what to do, then the value function is pointless, assuming we have access to the reward function.

Then I replied, We don't have access to the reward function, because we can't... (read more)

2Abram Demski12dAll sounds perfectly reasonable. I just hope you recognize that it's all a big mess (because it's difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or any other problematic interpretations). As I imagine you're aware, I think we need stuff from my 'learning normativity' agenda to dodge these bullets. In particular, I would hesitate to commit to the idea that rewards are the only type of feedback we submit. FWIW, I'm now thinking of your "value function" as expected utility in Jeffrey-Bolker terms [] . We need not assume a utility function to speak of expected utility. This perspective is nice in that it's a generalization of what RL people mean by "value function" anyway: the value function is exactly the expected utility of the event "I wind up in this specific situation" (at least, it is if value iteration has converged). The Jeffrey-Bolker view just opens up the possibility of explicitly representing the value of more events. So let's see if we can pop up the conversational stack. I guess the larger topic at hand was: how do we define whether a value function is "aligned" (in an inner sense, so, when compared to an outer objective which is being used for training it)? Well, I think it boils down to whether the current value function makes "reliably good predictions" about the values of events. Not just good predictions on average, but predictions which are never catastrophically bad (or at least, catastrophically bad with very low probability, in some appropriate sense). If we think of the true value function as V(x), and our approximation as V(x), we want something like: under some distance metric, if there is a modification of V*(x) with catastrophic downsides, V(x) is closer to V*(x) than that modification. (OK that's a bit lame, but hopefully you get the general direction I'm trying to point in.
Building brain-inspired AGI is infinitely easier than understanding the brain

Thanks! I guess my feeling is that we have a lot of good implementation-level ideas (and keep getting more), and we have a bunch of algorithm ideas, and psychology ideas and introspection and evolution and so on, and we keep piecing all these things together, across all the different levels, into coherent stories, and that's the approach I think will (if continued) lead to AGI.

Like, I am in fact very interested in "methods for fast and approximate Bayesian inference" as being relevant for neuroscience and AGI, but I wasn't really interested in it until I l... (read more)

2Xuan (Tan Zhi Xuan)21dSome recent examples, off the top of my head! * Jain, Y. R., Callaway, F., Griffiths, T. L., Dayan, P., Krueger, P. M., & Lieder, F. (2021). A computational process-tracing method for measuring people’s planning strategies and how they change over time. [] * Dasgupta, I., Schulz, E., Tenenbaum, J. B., & Gershman, S. J. (2020). A theory of learning to infer. Psychological review, 127(3), 412. [] * Harrison, P., Marjieh, R., Adolfi, F., van Rijn, P., Anglada-Tort, M., Tchernichovski, O., ... & Jacoby, N. (2020). Gibbs Sampling with People. Advances in Neural Information Processing Systems, 33. [] I guess this depends on how much you think we can make progress towards AGI by learning what's innate / hardwired / learned at an early age in humans and building that into AI systems, vs. taking more of a "learn everything" approach! I personally think there may still be a lot of interesting human-like thinking and problem solving strategies that we haven't figured out to implement as algorithms yet (e.g. how humans learn to program, and edit + modify programs and libraries to make them better over time), that adult and child studies would be useful in order to characterize what might even be aiming for, even if ultimately the solution is to use some kind of generic learning algorithm to reproduce it. I also think there's this fruitful in-between (1) and (3), which is to ask, "What are the inductive biases that guide human learning?", which I think you can make a lot of headway on without getting to the neural level.
SGD's Bias

That makes sense. Now it's coming back to me: you zoom your microscope into one tiny nm^3 cube of air. In a right-to-left temperature gradient you'll see systematically faster air molecules moving rightward and slower molecules moving leftward, because they're carrying the temperature from their last collision. Whereas in uniform temperature, there's "detailed balance" (just as many molecules going along a path vs going along the time-reversed version of that same path, and with the same speed distribution).

Thinking about the diode-resistor thing more, I s... (read more)

SGD's Bias

I think the "drift from high-noise to low-noise" thing is more subtle than you're making it out to be... Or at least, I remain to be convinced. Like, has anyone else made this claim, or is there experimental evidence? 

In the particle diffusion case, you point out correctly that if there's a gradient in D caused by a temperature gradient, it causes a concentration gradient. But I believe that if there's a gradient in D caused by something other than a temperature gradient, then it doesn't cause a concentration gradient. Like, take a room with a big pil... (read more)

3johnswentworth1moI'm still wrapping my head around this myself, so this comment is quite useful. Here's a different way to set up the model, where the phenomenon is more obvious. Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves statek, it's equally likely to transition tok+1ork−1). The rateλkat which the system leaves statekserves a role analogous to the diffusion coefficient (with the analogy becoming precise in the continuum limit, I believe). Then the steady-state probabilities of statekand statek−1satisfy pkλk=pk−1λk−1 ... i.e. the flux from values-k-and-above to values-below-kis equal to the flux in the opposite direction. (Side note: we need some boundary conditions in order for the steady-state probabilities to exist in this model.) So, ifλk>λk−1, thenp k<pk−1: the system spends more time in lower-diffusion states (locally). Similarly, if the system's state is initially uniformly-distributed, then we see an initial flux from higher-diffusion to lower-diffusion states (again, locally). Going back to the continuous case: this suggests that your source vs destination intuition is on the right track. If we set up the discrete version of the pile-of-rocks model, air molecules won't go in to the rock pile any faster than they come out, whereas hot air molecules will move into a cold region faster than cold molecules move out. I haven't looked at the math for the diode-resistor system, but if the voltage averages to 0, doesn't that mean that it does spend more time on the lower-noise side? Because presumably it's typically further from zero on the higher-noise side. (More generally, I don't think a diffusion gradient means that a system drifts one way on average, just that it drifts one way with greater-than-even probability? Similar to how a bettor maximizing expected value with repeated independent b
Formal Inner Alignment, Prospectus

Wait, you think your prosaic story doesn't involve blind search over a super-broad space of models??

No, not prosaic, that particular comment was referring to the "brain-like AGI" story in my head...

Like, I tend to emphasize the overlap between my brain-like AGI story and prosaic AI. There is plenty of overlap. Like they both involve "neural nets", and (something like) gradient descent, and RL, etc.

By contrast, I haven't written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I'm thinking that there ... (read more)

2Abram Demski1moAh, ok. It sounds like I have been systematically mis-perceiving you in this respect. I would have been much more interested in your posts in the past if you had emphasized this aspect more ;p But perhaps you held back on that to avoid contributing to capabilities research. Yeah, this is a very important question!
Formal Inner Alignment, Prospectus

That's fair. Other possible approaches are "try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so", or "intepretability that looks for the AGI imagining dangerous adversarial intelligences".

I guess the fact that people don't tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible - like, that maybe there's a big window where one is smart enough to understand that imagining adversarial intelligence... (read more)

Formal Inner Alignment, Prospectus

Hm, I want to classify "defense against adversaries" as a separate category from both "inner alignment" and "outer alignment".

The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that's not any kind of alignment problem, it's a defense-against-adversaries problem.

Then I would take that notion and extend it by saying "yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too". Well, at least in weird hypotheticals. I'm not convinced that this would re... (read more)

3Abram Demski1moThis part doesn't necessarily make sense, because prevention could be easier than after-the-fact measures. In particular, 1. You might be unable to defend against arbitrarily adversarial cognition, so, you might want to prevent it early rather than try to detect it later, because you may be vulnerable in between. 2. You might be able to detect some sorts of misalignment, but not others. In particular, it might be very difficult to detect purposeful deception, since it intelligently evades whatever measures are in place. So your misalignment-detection may be dependent on averting mesa-optimizers or specific sorts of mesa-optimizers.
Formal Inner Alignment, Prospectus

My hunch is that we don't disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you're misinterpreting me as saying something more interesting than I am.

Formal Inner Alignment, Prospectus

Like, if we do gradient descent, and the training signal is "get a high score in PacMan", then "mesa-optimize for a high score in PacMan" is incentivized by the training signal, and "mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips" is also incentivized by the training signal.

For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better a... (read more)

1Ofer Givoli1moMy surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn't be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective. Note that the examples in my comment don't rely on deceptive alignment. To "convert" your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is "make the relevant memory location in the RAM say that I won the game", or "win the game in all future episodes".
Formal Inner Alignment, Prospectus

I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using "blind search over a super-broad, probably-even-Turing-complete, space of models" as one of its ingredients. I guess I'm just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs... (Of course I just wind up with a different set of unsolved AGI safety problems instead...)

The Evolutionary Story

By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentiona... (read more)

2Abram Demski1moWait, you think your prosaic story doesn't involve blind search over a super-broad space of models?? I think any prosaic story involves blind search over a super-broad space of models, unless/until the prosaic methodology changes, which I don't particularly expect it to. I agree that replacing "blind search" with different tools is a very important direction. But your proposal doesn't do that! I agree with this general picture. While I'm primarily knocking down bad complexity-based arguments in my post, I would be glad to see someone working on trying to fix them. There were a lot of misunderstandings in the earlier part of our conversation, so, I could well have misinterpreted one of your points. But if so, I'm even more struggling to see why you would have been optimistic that your RL scenario doesn't involve risk due to unintended mesa-optimization. By your own account, the other part would be to argue that they're not simple, which you haven't done. They're not actively disincentivized, because they can use the planning capability to perform well on the task (deceptively). So they can be selected for just as much as other hypotheses, and might be simple enough to be selected in fact.
3Ofer Givoli1moWe can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process. Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current epoch episode, but the result is an agent with a more general objective that cares about blue doors in future epochs episodes as well. In Evan's words [] (from the Future of Life podcast): Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute). Also, a malign prior problem may manifest in (self-)supervised learning settings [] . (Maybe you consider this to be a special case of (2).)
2Abram Demski1moI have not properly read all of that yet, but my very quick take is that your argument for a need for online learning strikes me as similar to your argument against the classic inner alignment problem applying to the architectures you are interested in. You find what I call mesa-learning implausible for the same reasons you find mesa-optimization implausible. Personally, I've come around to the position (seemingly held pretty strongly by other folks, eg Rohin) that mesa-learning is practically inevitable for most tasks [].
My AGI Threat Model: Misaligned Model-Based RL Agent

So maybe you mean that the ideal value function would be precisely the sum of rewards.

Yes, thanks, that's what I should have said.

In the rollout architecture you describe, there wouldn't really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function).

For "access to the reward function", we need to predict what the reward function will do (which may involve hard-to-predict things like "the human will be pleased with what I've done"). I guess your suggestion would be to call t... (read more)

3Abram Demski1moAh, that wasn't quite my intention, but I take it as an acceptable interpretation. My true intention was that the "reward function calculator" should indeed be directly accessible rather than indirectly learned via reward-function-model. I consider this normative (not predictive) due to the considerations about observation-utility agents discussed in Robust Delegation [] (and more formally in Daniel Dewey's paper []). Learning the reward function is asking for trouble. Of course, hard-coding the reward function is also asking for trouble, so... * shrug*
My AGI Threat Model: Misaligned Model-Based RL Agent

if we're talking about predicting what ML people will do, the sentence "the value function is a function of the latent variables in the world model" makes a lot more sense than the clarification "even abstract concepts are assigned values".

OK sure, that's fair. Point well taken. I was thinking about more brain-like neural nets that parse things into compositional pieces. If I wanted to be more prosaic maybe I would say something like: "She is differentiating both sides of the equation" could have a different value than "She is writing down a bunch of funny symbols", even if both are coming from the exact same camera inputs.

My AGI Threat Model: Misaligned Model-Based RL Agent


> The value function might be different from the reward function.

Surely this isn't relevant! We don't by any means want the value function to equal the reward function. What we want (at least in standard RL) is for the value function to be the solution to the dynamic programming problem set up by the reward function and world model (or, more idealistically, the reward function and the actual world).

Hmm. I guess I have this ambiguous thing where I'm not specifying whether the value function is "valuing" world-states, or actions, or plans, or all ... (read more)

3Abram Demski1moSure, but given most reasonable choices, there will be an analogous variant of my claim, right? IE, for most reasonable model-based RL setups, the type of the reward function will be different from the type of the value function, but there will be a "solution concept" saying what it means for the value function to be correct with respect to a set reward function and world-model. This will be your notion of alignment, not "are the two equal". Well, there's still a type distinction. The reward function gives a value at each time step in the long rollout, while the value function just gives an overall value. So maybe you mean that the ideal value function would be precisely the sum of rewards. But if so, this isn't really what RL people typically call a value function. The point of a value function is to capture the potential future rewards associated with a state. For example, if your reward function is to be high up, then the value of being near the top of a slide is very low (because you'll soon be at the bottom), even if it's still generating high reward (because you're currently high up). So the value of a history (even a long rollout of the future) should incorporate anticipated rewards after the end of the history, not just the value observed within the history itself. In the rollout architecture you describe, there wouldn't really be any point to maintaining a separate value function, since you can just sum the rewards (assuming you have access to the reward function). It doesn't seem to me like there is any "more/less like reward" spectrum here. The value function is just different from the reward function. In an architecture where you have a "value function" which operates like a reward function, I would just call it the "estimated reward function" or something along those lines, because RL people invented the value/reward distinction to point at something important (namely the difference between immediate reward and cumulative expected reward), and I d
Human priors, features and models, languages, and Solmonoff induction

Model splintering happens when someone has updated on enough unusual sightings that it is worth their while to change their "language".

I think of human mental model updates as being overwhelmingly "adding more things" rather than "editing existing things". Like you see a funny video of a fish flopping around, and then a few days later you say "hey, look at the cat, she's flopping around just like that fish video". I'm not sure I'm disagreeing with you here, but your language kinda implies rare dramatic changes, I guess like someone changing religion and having an ontological crisis. That's certainly an important case but much less common.

3Stuart Armstrong1moFor real humans, I think this is a more gradual process - they learn and use some distinctions, and forget others, until their mental models are quite different a few years down the line. The splintering can happen when a single feature splinters; it doesn't have to be dramatic.
Can you get AGI from a Transformer?

I slightly edited that section header to make it clearer what the parenthetical "(matrix multiplications, ReLUs, etc.)" is referring to. Thanks!

I agree that it's hard to make highly-confident categorical statements about all current and future DNN-ish architectures.

I don't think the human planning algorithm is very much like MCTS, although you can learn to do MCTS (just like you can learn to mentally run any other algorithm—people can learn strategies about what thoughts to think, just like they can strategies about what actions to execute). I think the bu... (read more)

Can you get AGI from a Transformer?

Oh OK I think I misunderstood you.

So the context was: I think there's an open question about the extent to which the algorithms underlying human intelligence in particular, and/or AGI more generally, can be built from operations similar to matrix multiplication (and a couple other operations). I'm kinda saying "no, it probably can't" while the scaling-is-all-you-need DNN enthusiasts are kinda saying "yes, it probably can".

Then your response is that humans can't multiply matrices in their heads. Correct? But I don't think that's relevant to this question. L... (read more)

1Matthew "Vaniver" Graves1moAh, I now suspect that I misunderstood you as well earlier: you wanted your list to be an example of "what you mean by DNN-style calculations" but I maybe interpreted as "a list of things that are hard to do with DNNs". And under that reading, it seemed unfair because the difficulty that even high-quality DNNs have in doing simple arithmetic is mirrored by the difficulty that humans have in doing simple arithmetic. Similarly, I agree with you that there are lots of things that seem very inefficient to implement via DNNs rather than directly (like MCTS, or simple arithmetic, or so on), but it wouldn't surprise me if it's not that difficult to have a DNN-ish architecture that can more easily implement MCTS than our current ones. The sorts of computations that you can implement with transformers are more complicated than the ones you could implement with convnets, which are more complicated than the ones you could implement with fully connected nets; obviously you can't gradient descent a fully connected net into a convnet, or a convnet into a transformer, but you can still train a transformer with gradient descent. It's also not obvious to me that humans are doing the more sophisticated thinking 'the smart way' instead of 'the dumb way'. Suppose our planning algorithms are something like MCTS; is it 'coded in directly' like AlphaGo's, or is it more like a massive transformer that gradient-descented its way into doing something like MCTS? Well, for things like arithmetic and propositional logic, it seems pretty clearly done 'the dumb way', for things like planning and causal identification it feels more like an open question, and so I don't want to confidently assert that our brains are doing it the dumb way. My best guess is they have some good tricks, but won't be 'optimal' according to future engineers who understand all of this stuff.
Draft report on existential risk from power-seeking AI

I really like the report, although maybe I'm not a neutral judge, since I was already inclined to agree with pretty much everything you wrote. :-P

My own little AGI doom scenario is very much in the same mold, just more specific on the technical side. And much less careful and thorough all around. :)

Draft report on existential risk from power-seeking AI

For benefits of generality (, an argument I find compelling is that if you're trying to invent a new invention or design a new system, you need a cross-domain system-level understanding of what you're trying to do and how. Like at my last job, it was not at all unusual for me to find myself sketching out the algorithms on a project and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks... (read more)

Three reasons to expect long AI timelines

Thanks for the nice post! Here's why I disagree :)

Technological deployment lag

Normal technologies require (1) people who know how to use the technology, and (2) people who decide to use the technology. If we're thinking about a "real-deal AGI" that can do pretty much every aspect of a human job but better and cheaper, then (1) isn't an issue because the AGI can jump into existing human roles. It would be less like "technology deployment" and more like a highly-educated exquisitely-skilled immigrant arriving into a labor market. Such a person would have no ... (read more)

My AGI Threat Model: Misaligned Model-Based RL Agent

Strong agree. This is another way that it's a hard problem.

Against evolution as an analogy for how humans will create AGI

Thanks for cross-posting this! Sorry I didn't get around to responding originally. :-)

E.g. the thing RL currently does, which I don't expect the inner algorithm to be able to do, is make the first three layers of the network vision layers, and then a big region over on the other side the language submodule, and so on. And eventually I expect RL to shape the way the inner algorithm does weight updates, via meta-learning.

For what it's worth, I figure that the neocortex has some number (dozens to hundreds, maybe 180 like your link says, I dunno) of subregions... (read more)

My AGI Threat Model: Misaligned Model-Based RL Agent

Hmm, I dunno, I haven't thought it through very carefully. But I guess an AGI might require a supercomputer of resources and maybe there are only so many hackable supercomputers of the right type, and the AI only knows one exploit and leaves traces of its hacking that computer security people can follow, and meanwhile self-improvement is hard and slow (for example, in the first version you need to train for two straight years, and in the second self-improved version you "only" need to re-train for 18 months). If the AI can run on a botnet then there are mo... (read more)

My AGI Threat Model: Misaligned Model-Based RL Agent

Oh sorry, I misread what you wrote. Sure, maybe, I dunno. I just edited the article to say "some number of years".

I never meant to make a claim "20 years is definitely in the realm of possibility" but rather to make a claim "even if it takes 20 years, that's still not necessarily enough to declare that we're all good".

2Daniel Kokotajlo3moAh, OK. We are on the same page then.
My AGI Threat Model: Misaligned Model-Based RL Agent


For homogeneity, I guess I was mainly thinking that in the era of not-knowing-how-to-align-an-AGI, people would tend to try lots of different new things, because nothing so far has worked. I agree that once there's an aligned AGI, it's likely to get copied, and if new better AGIs are trained, people may be inclined to try to keep the procedure as close as possible to what's worked before.

I hadn't thought about whether different AGIs with different goals are likely to compromise vs fight. There's Wei Dai's argument that compromise is very easy with A... (read more)

My AGI Threat Model: Misaligned Model-Based RL Agent

I haven't thought very much about takeoff speeds (if that wasn't obvious!). But I don't think it's true that nobody thinks it will take more than a decade... Like, I don't think Paul Christiano is the #1 slowest of all slow-takeoff advocates. Isn't Robin Hanson slower? I forget.

Then a different question is "Regardless of what other people think about takeoff speeds, what's the right answer, or at least what's plausible?" I don't know. A key part is: I'm hazy on when you "start the clock". People were playing with neural networks in the 1990s but we only go... (read more)

3Daniel Kokotajlo3moThanks! Yeah, there are plenty of people who think takeoff will take more than a decade--but I guess I'll just say, I'm pretty sure they are all wrong. :) But we should take care to define what the start point of takeoff is. Traditionally it was something like "When the AI itself is doing most of the AI research," but I'm very willing to consider alternate definitions. I certainly agree it might take more than 10 years if we define things in such a way that takeoff has already begun. Wait, uhoh, I didn't mean "the AI did something by accident" either... can you elaborate? By "accident" I thought you meant something like "Small-scale disasters, betrayals, etc. caused by AI that are shocking enough to count as warning shots / fire alarms to at least some extent."
Against evolution as an analogy for how humans will create AGI
  1. I think evolution is a good analogy for how inner alignment issues can arise.
  2. I don't think evolution is a good analogy for the process by which AGI is made (if you think that the analogy is that we literally use natural selection to improve AI systems).

Yes this post is about the process by which AGI is made, i.e. #2. (See "I want to be specific about what I’m arguing against here."...) I'm not sure what you mean by "literal natural selection", but FWIW I'm lumping together outer-loop optimization algorithms regardless of whether they're evolutionary or gradient descent or downhill-simplex or whatever.

Against evolution as an analogy for how humans will create AGI

Thanks for all those great references!

My current thinking is: (1) Outer-loop meta-learning is slow, (2) Therefore we shouldn't expect to get all that many bits of information out of it, (3) Therefore it's a great way to search for parameter settings in a parameterized family of algorithms, but not a great way to do "the bulk of the real design work", in the sense that programmers can look at the final artifact and say "Man, I have no idea what this algorithm is doing and why it's learning anything at all, let alone why it's learning things very effectively... (read more)

Against evolution as an analogy for how humans will create AGI

Thanks again, this is really helpful.

I don't feel like humans meet this bar.

Hmm, imagine you get a job doing bicycle repair. After a while, you've learned a vocabulary of probably thousands of entities and affordances and interrelationships (the chain, one link on the chain, the way the chain moves, the feel of clicking the chain into place on the gear, what it looks like if a chain is loose, what it feels like to the rider when a chain is loose, if I touch the chain then my finger will be greasy, etc. etc.). All that information is stored in a highly-stru... (read more)

3Rohin Shah3moAll of that sounds reasonable to me. I still don't see why you think editing weights is required, as opposed to something like editing external memory. (Also, maybe we just won't have AGI that learns by reading books, and instead it will be more useful to have a lot of task-specific AI systems with a huge amount of "built-in" knowledge, similarly to GPT-3. I wouldn't put this as my most likely outcome, but it seems quite plausible.)
Against evolution as an analogy for how humans will create AGI


A lot of your comments are trying to relate this to GPT-3, I think. Maybe things will be clearer if I just directly describe how I think about GPT-3.

The evolution analogy (as I'm defining it) says that “The AGI” is identified as the inner algorithm, not the inner and outer algorithm working together. In other words, if I ask the AGI a question, I don’t need the outer algorithm to be running in the course of answering that question. Of course the GPT-3 trained model is already capable of answering "easy" questions, but I'm thinking here about "very h... (read more)

3Rohin Shah3moThanks, this was helpful in understanding in where you're coming from. I don't feel like humans meet this bar. Maybe mathematicians, and even then, I probably still wouldn't agree. Especially not humans without external memory (e.g. paper). But presumably such humans still count as generally intelligent. Seems reasonable. I think this makes sense in the context of humans but not in the context of AI (if you say weights = synapses). It seems totally plausible to give AI systems an external memory that they can read to / write from, and then you learn linear algebra without editing weights but with editing memory. Alternatively, you could have a recurrent neural net with a really big hidden state, and then that hidden state could be the equivalent of what you're calling "synapses". This feels analogous to "the AGI doesn't go and run on its own, it operates by changing values in RAM according to the assembly language interpreter hardwired into the CPU chip". Like, it's true, but it seems like it's operating at the wrong level of abstraction. Once you've reached the point of creating schools and courses, and using spaced repetition and practice exercises, you probably don't want to be thinking in terms of "this is all stuff that's been done by the synapse-editing algorithm hardwired into the genome", you've shifted to a qualitatively new kind of learning. ---- It seems like a central crux here is: Is it possible to build a reasonably efficient AGI that doesn't autonomously edit its weights after training? (By AGI here I mean something about as capable as humans on a variety of tasks.) Caveats on my "yes" position: 1. I wouldn't be that surprised if in practice it turns out that continually editing the weights even at deployment time is the most efficient thing to do, but I would be surprised if the difference is many orders of magnitude. 2. I do expect that we will continue to update AGI systems via editing weights in training loops, even after
Against evolution as an analogy for how humans will create AGI

Hmm, if you don't know which bits are the learning algorithm and which are the learned content, and they're freely intermingling, then I guess you could try randomizing different subsets of the bits in your algorithm, and see what happens, or something, and try to figure it out. This seems like a computationally-intensive and error-prone process, to me, although I suppose it's hard to know. Also, which is which could be dynamic, and there could be bits that are not cleanly in either category. If you get it wrong, then you're going to wind up updating the k... (read more)

Against evolution as an analogy for how humans will create AGI


ML is different from other domains in that it already relies on incredibly massive automated search, with massive changes in the quality of our inner algorithms despite very little change in our outer algorithms.

Yeah, sure, maybe. Outside views only go so far :-)

I concede that even if an evolution-like approach was objectively the best way to build wing-flapping robots, probably those roboticists would not think to actually do that, whereas it probably would occur to ML researchers.

(For what it's worth—and I don't think you were disagreeing with thi... (read more)

Against evolution as an analogy for how humans will create AGI

Good question! 

A kinda generic answer is: (1) Transformers were an advance over previous learning algorithms, and by the same token I expect that yet-to-be-invented learning algorithms will be an advance over Transformers; (2) Sample-efficient learning is AFAICT a hot area that lots of people are working on; (3) We do in fact actually have impressively sample-efficient algorithms even if they're not as well-developed and scalable as others at the moment—see my discussion of analysis-by-synthesis; (4) Given that predictive learning offers tons of data,... (read more)

2Daniel Kokotajlo3moTo make sure I understand: you are saying (a) that our AIs are fairly likely to get significantly more sample-efficient in the near future, and (b) even if they don't, there's plenty of data around. I think (b) isn't a good response if you think that transformative AI will probably need to be human brain sized and you believe the scaling laws and you think that short-horizon training won't be enough. (Because then we'll need something like 10^30+ FLOP to train TAI, which is plausibly reachable in 20 years but probably not in 10. That said, I think short-horizon training might be enough. I think (a) is a good response, but it faces the objection: Why now? Why should we expect sample-efficiency to get dramatically better in the near future, when it has gotten only very slowly better in the past? (Has it? I'm guessing so, maybe I'm wrong?)
Four Motivations for Learning Normativity

To be meaningful, this requires whole-process feedback: we need to judge thoughts by their entire chain of origination. (This is technically challenging, because the easiest way to implement process-level feedback is to create a separate meta-level which oversees the rest of the system; but then this meta-level would not itself be subject to oversight.)

I thought you were going to say it's technically challenging because you need transparency / intepretability ... At least in human cognition (and logical induction too right?) thoughts-about-stuff and tho... (read more)

5Abram Demski3moWell, transparency is definitely a challenge. I'm mostly saying this is a technical challenge even if you have magical transparency tools, and I'm kind of trying to design the system you would want to use if you had magical transparency tools. But I don't think it's difficult for the reason you say. I don't think multi-level feedback or whole-process feedback should be construed as requiring the levels to be sorted out nicely. Whole-process feedback in particular just means that you can give feedback on the whole chain of computation; it's basically against sorting into levels. Multi-level feedback means, to me, that if we have an insight about, EG, how to think about value uncertainty (which is something like a 3rd-level thought: 1st level is information about object level; 2nd level is information about the value function; 3rd level is information about how to learn the value function), we can give the system feedback about that. So the system doesn't need to sort things out into levels; it just needs to be capable of accepting feedback of each type.
Book review: "A Thousand Brains" by Jeff Hawkins


I'm fine with you redirecting to a previous post, but I would have appreciated at least a one sentence-summary and opinion.

My opinion is: I think if you want to figure out the gory details of the neocortical algorithm, and you want to pick ten authors to read, then Jeff Hawkins should be one of them. If you're only going to pick one author, I'd go with Dileep George.

I'm happy to chat more offline.

what is the argument for the neocortex learning algorithm being human-legible?

Well there's an inside-view argument that it's human-legible because "It basi... (read more)

Book review: "A Thousand Brains" by Jeff Hawkins

A couple years ago I spent a month or two being enamored with the idea of tool AI via self-supervised learning (which is basically what you're talking about, i.e. the neocortex without a reward channel), and I wrote a few posts like In Defense of Oracle ("Tool") AI Research and Self-Supervised Learning and AGI Safety. I dunno, maybe it's still the right idea. But at least I can explain why I personally grew less enthusiastic about it.

One thing was, I came to believe (for reasons here, and of course I also have to cite this) that it doesn't buy the safety g... (read more)

Book review: "A Thousand Brains" by Jeff Hawkins

Oh I'm very open-minded. I was writing that section for an audience of non-AGI-safety-experts and didn't want to make things over-complicated by working through the full range of possible solutions to the problem, I just wanted to say enough to convince readers that there is a problem here, and it's not trivial.

The Judge box (usually I call it "steering subsystem") can be anything. There could even be a tower of AGIs steering AGIs, IDA-style, but I don't know the details, like what you would put at the base of the tower. I haven't really thought about it. ... (read more)

adamShimi's Shortform

Not Adam, but

  1. Maybe there's a sense in which everyone has already implicitly declared that they don't want to give feedback, because they could have if they wanted to, so it feels like more of an imposition.

  2. Maybe it feels like "I want feedback for my own personal benefit" when it's already posted, as opposed to "I want feedback to improve this document which I will share with the community" when it's not yet posted. So it feels more selfish, instead of part of a community project. For that problem, maybe you'd want to frame it as "I'm planning to rewr

... (read more)
2Adam Shimi4moMy main reason is steve's first point: Asking someone for feedback on work posted somewhere I know they read feels like I'm whining about not having feedback (and maybe whining about them not giving me feedback). On the other hand, sending a link to a gdoc feels like "I thought that could interest you", which seems better to me. There's also the issue that when the work is public, you don't know if someone has read it and not found it interesting enough to comment, not read it but planned to do it later, read it and planned to comment later. Depending on which case they are in, me asking for feedback can trigger even more problems (like them being annoyed because they don't feel I let them the time to do it by themselves). Whereas when I share a doc, there's only one state of knowledged for the other (not having read the doc and not knowing it exists). Concerning steve's second point: I don't feel that personally. I basically take a stance of trying to do things I feel are important for the community, so if I publish something, I don't feel like feedback is for my own benefit. Indeed, I would gladly have only constructive negative feedback for my posts instead of no feedback at all; this is pretty bad personnally (in terms of ego for example) but great for the community because it put my ideas to the test and forces me to improve them. Now I want to go back to Raemon. Agreed. My diagnostic of the situation is that to ensure consistent feedback, it probably need to be at least slightly an obligation. The two examples of process producing valuable feedback that I have in mind are gdocs comments and peer-review for conferences/journals. In both cases, the reviewer has an obligation to do the review (social obligation for the gdoc, because it was shared explicity to you, and community obligation for the peer-review, because that's a part of your job and the conference/journal editor asked you to review the paper). Without this element of obligation, it's far to ea
Bootstrapped Alignment

Reminds me of a quote from this Paul Christiano post: "It's a solution built to last (at most) until all contemporary thinking about AI has been thoroughly obsoleted...I don’t think there is a strong case for thinking much further ahead than that."

1avturchin4moYes, it also reminded me Christiano approach of amplification and distillation.
Load More