Steve Byrnes

I'm an AGI safety researcher in Boston, MA, USA, with a particular focus on brain algorithms. See Email:

Wiki Contributions


Goodhart Ethology

Oh, yup, makes sense thanks

Goodhart Ethology

now suppose this curve represents the human ratings of different courses of action, and you choose the action that your model says will have the highest rating. You're going to predictably mess up again, because of the optimizer's curse (or regressional Goodhart on the correlation between modeled rating and actual rating).

It's not obvious to me how the optimizer's curse fits in here (if at all). If each of the evaluations has the same noise, then picking the action that the model says will have the highest rating is the right thing to do. The optimizer's curse says that the model is likely to overestimate how good this "best" action is, but so what? "Mess up" conventionally means "the AI picked the wrong action", and the optimizer's curse is not related to that (unless there's variable noise across different choices and the AI didn't correct for that). Sorry if I'm misunderstanding.

Paths To High-Level Machine Intelligence

This is great! Here's a couple random thoughts:

Hybrid statistical-symbolic HLMI

I think there's a common habit of conflating "symbolic" with "not brute force" and "not bitter lesson" but that it's not right. For example, if I were to write an algorithm that takes a ton of unstructured data and goes off and builds a giant PGM that best explains all that data, I would call that a "symbolic" AI algorithm (because PGMs are kinda discrete / symbolic / etc.), but I would also call it a "statistical" AI algorithm, and I would certainly call it "compatible with The Bitter Lesson".

(Incidentally, this description is pretty close to my oversimplified caricature description of what the neocortex does.)

(I'm not disputing that "symbolic" and "containing lots of handcrafted domain-specific structure" do often go together in practice today—e.g. Josh Tenenbaum's papers tend to have both and OpenAI papers tend to have neither—I'm just saying they don't necessarily go together.)

I don't have a concrete suggestion of what if anything you should change here, I'm just chatting. :-)

Cognitive-science approach

This is all fine except that I kinda don't like the term "cognitive science" for what you're talking about. Maybe it's just me, but anyway here's where I'm coming from:

Learning algorithms almost inevitably have the property that the trained models are more complex than the learning rules that create them. For example, compare the code to run gradient descent and train a ConvNet (it's not very complicated) to the resulting image-classification algorithms as explored in the OpenAI microscope project (they're much more complicated).

I bring this up because "cognitive science" to me has a connotation of "the study of how human brains do all the things that human brains do", especially adult human brains. After all, that's the main thing that most cognitive scientists study AFAICT. So if you think that human intelligence is "mostly a trained model", then you would think that most cognitive science is "mostly akin to OpenAI microscope" as opposed to "mostly akin to PyTorch", and therefore mostly unnecessary and unhelpful for building HLMI. You don't have to think that—certainly Gary Marcus & Steven Pinker don't—but I do (to a significant extent) and at least a few other prominent neuroscientists do too (e.g. Randall O'Reilly). (See "learning-from-scratch-ism" section here, also cortical uniformity here.) So as much as I buy into (what you call) "the cognitive-science approach", I'm not just crazy about that term, and for my part I prefer to talk about "brain algorithms" or "high-level brain algorithms". I think "brain algorithms" is more agnostic about the nature of the algorithms, and in particular whether it's the kinds of algorithms that neuroscientists talk about, versus the kinds of algorithms that computational cognitive scientists & psychologists talk about.

Research agenda update

Good question!

Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as "goals", and then makes plans to advance those "goals". (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is "aligned" if the things flagged as "goals" do in fact corresponding to maximizing the objective function (e.g. "predict the human's outputs"), or at least it's as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.

Making that definition better and rigorous would be tricky because it's hard to talk rigorously about symbol-grounding, but maybe it's not impossible. And if so, I would say that this is a definition of "aligned" which looks nothing like a performance guarantee.

OK, hmmm, after some thought, I guess it's possible that this definition of "aligned" would be equivalent to a performance-centric claim along the lines of "asymptotically, performance goes up not down". But I'm not sure that it's exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:

  • We prove that the algorithm is aligned (in the above sense) via "direct reasoning about alignment" (i.e. talking about symbol-grounding, goal-stability, etc.), and then a corollary of that proof would be the asymptotic performance guarantee.
  • We prove that the algorithm satisfies the asymptotic performance guarantee via "direct reasoning about performance", and then a corollary of that proof would be that the algorithm is aligned (in the above sense).

I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it's solvable at all, it would only be solvable with the help of various specific "alignment-promoting algorithm features", and we won't be able to prove that those features work except by "direct reasoning about alignment".

Research agenda update

Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled "Lemma 1: This learning algorithm will be aligned".

The reason I think that is that (as above) I expect the learning algorithms in question to be kinda "agential", and if an "agential" algorithm is not "trying" to perform well on the objective, then it probably won't perform well on the objective! :-)

If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we've already solved the alignment problem. So the performance guarantee would be besides the point (on this view).

Research agenda update

Hmmm, OK, let me try again.

You wrote earlier: "the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally".

My claim is that good-enough algorithms for "adding more and more detail incrementally" will also incidentally (by default) be algorithms that seize control of their off-switches.

And the reason I put a lot of weight on this claim is that I think the best algorithms for "adding more and more detail incrementally" may be algorithms that are (loosely speaking) "trying" to understand and/or predict things, including via metacognition and instrumental reasoning.

OK, then the way I'm currently imagining you responding to that would be:

My model of Vanessa: We're hopefully gonna find a learning algorithm with a provable regret bound (or something like that). Since seizing control of the off-switch would be very bad according to the objective function and thus violate the regret bound, and since we proved the regret bound, we conclude that the learning algorithm won't seize control of the off-switch.

(If that's not the kind of argument you have in mind, oops sorry!)

Otherwise: I feel like that's akin to putting "the AGI will be safe" as a desideratum, which pushes "solve AGI safety" onto the opposite side of the divide between desiderata vs. learning-algorithm-that-satisfies-the-desiderata. That's perfectly fine, and indeed precisely defining "safe" is very useful. It's only a problem if we also claim that the "find a learning algorithm that satisfies the desiderata" part is not an AGI safety problem. (Also, if we divide the problem this way, then "we can't find a provably-safe AGI design" would be re-cast as "no human-level learning algorithms satisfy the desiderata".)

That's also where I was coming from when I expressed skepticism about "strong formal guarantees". We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge. Again, as above, I was imagining an argument that turns a performance guarantee into a safety guarantee, like "I can prove that AlphaGo plays go at such-and-such Elo level, and therefore it must not be wireheading, because wireheaders aren't very good at playing Go." If you weren't thinking of performance guarantees, what "formal guarantees" are you thinking of?

(For what little it's worth, I'd be a bit surprised if we get a safety guarantee via a performance guarantee. It strikes me as more promising to reason about safety directly—e.g. "this algorithm won't seize control of the off-switch because blah blah incentives blah blah mesa-optimizers blah blah".)

Sorry if I'm still misunderstanding. :)

Research agenda update

Thanks!! Here's where I'm at right now.

In the grandparent comment I suggested that if we want to make an AI that can learn sufficiently good hypotheses to do human-level things, perhaps the only way to do that is to make a "prior-building AI" with "agency" that is "trying" to build out its world-model / toolkit-of-concepts-and-ideas in fruitful directions. And I said that we have to solve the problem of how to build that kind of agential "prior-building AI" that doesn't also incidentally "try" to seize control of its off-switch.

Then in the parent comment you replied (IIUC) that if this is a problem at all, it's not the problem you're trying to solve (i.e. "finding good formal desiderata for safe TAI"), but a different problem (i.e. "developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms"), and my problem is "to a first approximation orthogonal" to your problem, and my problem "receives plenty of attention from outside the existential safety community".

If so, my responses would be:

  • Obviously the problem of "make an agential "prior-building AI" that doesn't try to seize control of its off-switch" is being worked on almost exclusively by x-risk people.  :-P
  • I suspect that the problem doesn't decompose the way you imply; instead I think that if we develop techniques for building a safe agential "prior-building AI", we would find that similar techniques enable us to build a safe non-manipulative-question-answering AI / oracle AI / helper AI / whatever.
  • Even if that's not true, I would still say that if we can make a safe agential "prior-building AI" that gets to human-level predictive ability and beyond, then we've solved almost the whole TAI safety problem, because we could then run the prior-building AI, then turn it off and use microscope AI to extract a bunch of new-to-humans predictively-useful concepts from the prior it built—including new ideas & concepts that will accelerate AGI safety research.

Or maybe another way of saying it would be: I think I put a lot of weight on the possibility that those "learning algorithms with strong formal guarantees" will turn out not to exist, at least not at human-level capabilities.

I guess, when I read "learning algorithms with strong formal guarantees", I'm imaging something like multi-armed bandit algorithms that have regret bounds. But I'm having trouble imagining how that kind of thing would transfer to a domain where we need the algorithm to discover new concepts and leverage them for making better predictions, and we don't know a priori what the concepts look like, or how many there will be, or how hard they will be to find, or how well they will generalize, etc.

Can you get AGI from a Transformer?

Thanks for the comment!

First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.

Yeah even when I wrote this, I had already seen claims that the so-called MCTS is deterministic. But DeepMind and everyone else apparently calls it MCTS, and I figured I should just follow the crowd, and maybe this is just one of those things where terminology drifts in weird directions and one shouldn't think too hard about it, like how we say "my phone is ringing" when it's actually vibrating.  :-P

Looking into it again just now, I'm still not quite sure what's going on. This person says AlphaZero switches from random to non-random after 15 moves. And this person says AlphaZero is totally deterministic but "MCTS" is still the proper term, for reasons that don't make any sense to me. I dunno and I'm open to being educated here. Regardless, if you tell me that I should call it "tree search" instead of "MCTS", I'm inclined to take your word for it. I want to be part of the solution not part of the problem  :-D

NNs absolutely can plan in a 'pure' fashion: TreeQN (which they cite) constructs its own tree which it does its own planning/exploration over in a differentiable fashion.

That's an interesting example. I think I need to tone down my claim a bit (and edit the post). Thank you. I will now say exactly what I'm making of that example:

Here is a spectrum of things that one might believe, from most-scaling-hypothesis-y to least:

  1. If you take literally any DNN, and don't change the architecture or algorithm or hyperparameters at all, and just scale it up with appropriate training data and loss functions, we'll get AGI. And this is practical and realistic, and this is what will happen in the near future to create AGI.
  2. If you take literally any DNN, and don't change the architecture or algorithm or hyperparameters at all, and just scale it up with appropriate training data and loss functions, we'll get AGI in principle … but in practice obviously both people and meta-learning algorithms are working hard to find ever-better neural network architectures / algorithms / hyperparameters that give better performance per compute, and they will continue to do so. So in practice we should expect AGI to incorporate some tweaks compared to what we might build today.
  3. Those kinds of tweaks are not just likely for economic reasons but in fact necessary to get to AGI
  4. …and the necessary changes are not just "tweaks", they're "substantive changes / additions to the computation"
  5. …and they will be such a big change that the algorithm will need to involve some key algorithmic steps that are not matrix multiplications, ReLUs, etc.
  6. More than that, AGI will not involve anything remotely like a DNN, not even conceptually similar, not even as one of several components.

My impression is that you're at #2.

I put almost no weight on #6, and never have. In fact I'm a big believer in AGI sharing some aspects of DNNs, including distributed representations, and gradient descent (or at least "something kinda like gradient descent"), and learning from training data, and various kinds of regularization, etc. I think (uncoincidentally) that the family of (IMO neocortex-like) learning algorithms I was talking about in the main part of this post would probably have all those aspects, if they scale to AGI.

So my probability weight is split among #2,3,4,5.

To me, the question raised by the TreeQN paper is: should I shift some weight from #5 to #4?

When I look at the TreeQN paper, e.g. this source code file, I think I can characterize it as "they did some tree-structure-specific indexing operations". (You can correct me if I'm misunderstanding.) Are "tree-structure-specific indexing operations" included in my phrase "matrix multiplications, ReLUs, etc."? I dunno. Certainly stereotypical DNN code involves tons of indexing operations; it's not like it looks out of place! On the other hand, it is something that humans deliberately added to the code.

I guess in retrospect it was kinda pointless for me to make an argument for #5. I shouldn't have even brought it up. In the context of this post, I could have said: "The space of all possible learning algorithms is much vaster than Transformers—or even "slight tweaks on Transformers". Therefore we shouldn't take it for granted that either Transformers or "slight tweaks on Transformers" will necessarily scale to AGI—even if we believe in The Bitter Lesson."

And then a different (and irrelevant-to-this-post) question is whether "matrix multiplications, ReLUs, etc." (whatever that means) is a sufficiently flexible toolkit to span much of the space of all possible useful learning algorithms, in an efficiently-implemented way. My change from yesterday is: if I interpret this "toolkit" to include arbitrary indexing & masking operations, and also arbitrary control flow—basically, if this "toolkit" includes anything that wouldn't look out of place in today's typical DNN source code—then this is a much broader space of (efficiently-implemented) algorithms than I was mentally giving it credit for. This makes me more apt to believe that future AGI algorithms will be built using tools in this toolkit, but also more apt to believe that those algorithms could nevertheless involve a lot of new ideas and ingenuity, and less apt to believe that it's feasible for something like AutoML-Zero to search through the whole space of things that you can do with this toolkit, and less apt to describe the space of things you can build with this toolkit as "algorithms similar to DNNs". For example, here is a probabilistic program inference algorithm that's (at least arguably/partly) built using this "toolkit", and I really don't think of probabilistic program inference as "similar to DNNs".

Information At A Distance Is Mediated By Deterministic Constraints

I'm sure you already know this, but information can also travel a large distance in one hop, like when I look up at night and see a star. Or if someone 100 years ago took a picture of a star, and I look at the picture now, information has traveled 110 years and 10 light-years in just two hops.

But anyway, your discussion seems reasonable AFAICT for the case you're thinking of.

Research agenda update

why do we "still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it"? The objective is fully specified…

Thanks, that was a helpful comment. I think we're making progress, or at least I'm learning a lot here. :)

I think your perspective is: we start with a prior—i.e. the prior is an ingredient going into the algorithm. Whereas my perspective is: to get to AGI, we need an agent to build the prior, so to speak. And this agent can be dangerous.

So for example, let's talk about some useful non-obvious concept, like "informational entropy". And let's suppose that our AI cannot learn the concept of "informational entropy" from humans, because we're in an alternate universe where humans haven't yet invented the concept of informational entropy. (Or replace "informational entropy" with "some important not-yet-discovered concept in AI alignment.)

In that case, I see three possibilities.

  • First, the AI never winds up "knowing about" informational entropy or anything equivalent to it, and consequently makes worse predictions about various domains (human scientific and technological progress, the performance of certain algorithms and communications protocols, etc.)
  • Second (I think this is your model?): the AI's prior has a combinatorial explosion with every possible way of conceptualizing the world, of which an astronomically small proportion are actually correct and useful. With enough data, the AI settles into a useful conceptualization of the world, including some sub-network in its latent space that's equivalent to informational entropy. In other words: it "discovers" informational entropy by dumb process of elimination.
  • Third (this is my model): we get a prior by running a "prior-building AI". This prior-building AI has "agency"; it "actively" learns how the world works, by directing its attention etc. It has curiosity and instrumental reasoning and planning and so on, and it gradually learns instrumentally-useful metacognitive strategies, like a habit of noticing and attending to important and unexplained and suggestive patterns, and good intuitions around how to find useful new concepts, etc. At some point it notices some interesting and relevant patterns, attends to them, and after a few minutes of trial-and-error exploration it eventually invents the concept of informational entropy. This new concept (and its web of implications) then gets incorporated into the AI's new "priors" going forward, allowing the AI to make better predictions and formulate better plans in the future, and to discover yet more predictively-useful concepts, etc. OK, now we let this "prior-building AI" run and run, building an ever-better "prior" (a.k.a. "world-model"). And then at some point we can turn this AI off, and export this "prior" into some other AI algorithm. (Alternatively, we could also more simply just have one AI which is both the "prior-building AI" and the AI that does, um, whatever we want our AIs to do.)

It seems pretty clear to me that the third approach is way more dangerous than the second. In particular, the third one explicitly doing instrumental planning and metacognition, which seems like the same kinds of activities that could lead to the idea of seizing control of the off-switch etc.

However, my hypothesis is that the third approach can get us to human-level intelligence (or what I was calling a "superior epistemic vantage point") in practice, and that the other approaches can't.

So, I was thinking about the third approach—and that's why I said "we still have the whole AGI alignment / control problem" (i.e., aligning and controlling the "prior-building AI"). Does that help?

Load More