I'm an AGI safety researcher in Boston, MA, USA, with a particular focus on brain algorithms. See https://sjbyrnes.com/agi.html Email: email@example.com
Oh, yup, makes sense thanks
now suppose this curve represents the human ratings of different courses of action, and you choose the action that your model says will have the highest rating. You're going to predictably mess up again, because of the optimizer's curse (or regressional Goodhart on the correlation between modeled rating and actual rating).
It's not obvious to me how the optimizer's curse fits in here (if at all). If each of the evaluations has the same noise, then picking the action that the model says will have the highest rating is the right thing to do. The optimizer's curse says that the model is likely to overestimate how good this "best" action is, but so what? "Mess up" conventionally means "the AI picked the wrong action", and the optimizer's curse is not related to that (unless there's variable noise across different choices and the AI didn't correct for that). Sorry if I'm misunderstanding.
This is great! Here's a couple random thoughts:
Hybrid statistical-symbolic HLMI
I think there's a common habit of conflating "symbolic" with "not brute force" and "not bitter lesson" but that it's not right. For example, if I were to write an algorithm that takes a ton of unstructured data and goes off and builds a giant PGM that best explains all that data, I would call that a "symbolic" AI algorithm (because PGMs are kinda discrete / symbolic / etc.), but I would also call it a "statistical" AI algorithm, and I would certainly call it "compatible with The Bitter Lesson".
(Incidentally, this description is pretty close to my oversimplified caricature description of what the neocortex does.)
(I'm not disputing that "symbolic" and "containing lots of handcrafted domain-specific structure" do often go together in practice today—e.g. Josh Tenenbaum's papers tend to have both and OpenAI papers tend to have neither—I'm just saying they don't necessarily go together.)
I don't have a concrete suggestion of what if anything you should change here, I'm just chatting. :-)
This is all fine except that I kinda don't like the term "cognitive science" for what you're talking about. Maybe it's just me, but anyway here's where I'm coming from:
Learning algorithms almost inevitably have the property that the trained models are more complex than the learning rules that create them. For example, compare the code to run gradient descent and train a ConvNet (it's not very complicated) to the resulting image-classification algorithms as explored in the OpenAI microscope project (they're much more complicated).
I bring this up because "cognitive science" to me has a connotation of "the study of how human brains do all the things that human brains do", especially adult human brains. After all, that's the main thing that most cognitive scientists study AFAICT. So if you think that human intelligence is "mostly a trained model", then you would think that most cognitive science is "mostly akin to OpenAI microscope" as opposed to "mostly akin to PyTorch", and therefore mostly unnecessary and unhelpful for building HLMI. You don't have to think that—certainly Gary Marcus & Steven Pinker don't—but I do (to a significant extent) and at least a few other prominent neuroscientists do too (e.g. Randall O'Reilly). (See "learning-from-scratch-ism" section here, also cortical uniformity here.) So as much as I buy into (what you call) "the cognitive-science approach", I'm not just crazy about that term, and for my part I prefer to talk about "brain algorithms" or "high-level brain algorithms". I think "brain algorithms" is more agnostic about the nature of the algorithms, and in particular whether it's the kinds of algorithms that neuroscientists talk about, versus the kinds of algorithms that computational cognitive scientists & psychologists talk about.
Imagine we have a learning algorithm that learns a world-model, and flags things in the world-model as "goals", and then makes plans to advance those "goals". (An example of such an algorithm is (part of) the human brain, more-or-less, according to me.) We can say the algorithm is "aligned" if the things flagged as "goals" do in fact corresponding to maximizing the objective function (e.g. "predict the human's outputs"), or at least it's as close a match as anything in the world-model, and if this remains true even as the world-model gets improved and refined over time.
Making that definition better and rigorous would be tricky because it's hard to talk rigorously about symbol-grounding, but maybe it's not impossible. And if so, I would say that this is a definition of "aligned" which looks nothing like a performance guarantee.
OK, hmmm, after some thought, I guess it's possible that this definition of "aligned" would be equivalent to a performance-centric claim along the lines of "asymptotically, performance goes up not down". But I'm not sure that it's exactly the same. And even if it were mathematically equivalent, we still have the question of what the proof would look like, out of these two possibilities:
I think it would be the first one, not the second. Why? Because it seems to me that the alignment problem is hard, and if it's solvable at all, it would only be solvable with the help of various specific "alignment-promoting algorithm features", and we won't be able to prove that those features work except by "direct reasoning about alignment".
Cool, gotcha, thanks. So my current expectation is either: (1) we will never be able to prove any performance guarantees about human-level learning algorithms, or (2) if we do, those proofs would only apply to certain algorithms that are packed with design features specifically tailored to solve the alignment problem, and any proof of a performance guarantee would correspondingly have a large subsection titled "Lemma 1: This learning algorithm will be aligned".
The reason I think that is that (as above) I expect the learning algorithms in question to be kinda "agential", and if an "agential" algorithm is not "trying" to perform well on the objective, then it probably won't perform well on the objective! :-)
If that view is right, the implication is: the only way to get a performance guarantee is to prove Lemma 1, and if we prove Lemma 1, we no longer care about the performance guarantee anyway, because we've already solved the alignment problem. So the performance guarantee would be besides the point (on this view).
Hmmm, OK, let me try again.
You wrote earlier: "the algorithm somehow manages to learn those hypotheses, for example by some process of adding more and more detail incrementally".
My claim is that good-enough algorithms for "adding more and more detail incrementally" will also incidentally (by default) be algorithms that seize control of their off-switches.
And the reason I put a lot of weight on this claim is that I think the best algorithms for "adding more and more detail incrementally" may be algorithms that are (loosely speaking) "trying" to understand and/or predict things, including via metacognition and instrumental reasoning.
OK, then the way I'm currently imagining you responding to that would be:
My model of Vanessa: We're hopefully gonna find a learning algorithm with a provable regret bound (or something like that). Since seizing control of the off-switch would be very bad according to the objective function and thus violate the regret bound, and since we proved the regret bound, we conclude that the learning algorithm won't seize control of the off-switch.
(If that's not the kind of argument you have in mind, oops sorry!)
Otherwise: I feel like that's akin to putting "the AGI will be safe" as a desideratum, which pushes "solve AGI safety" onto the opposite side of the divide between desiderata vs. learning-algorithm-that-satisfies-the-desiderata. That's perfectly fine, and indeed precisely defining "safe" is very useful. It's only a problem if we also claim that the "find a learning algorithm that satisfies the desiderata" part is not an AGI safety problem. (Also, if we divide the problem this way, then "we can't find a provably-safe AGI design" would be re-cast as "no human-level learning algorithms satisfy the desiderata".)
That's also where I was coming from when I expressed skepticism about "strong formal guarantees". We have no performance guarantee about the brain, and we have no performance guarantee about AlphaGo, to my knowledge. Again, as above, I was imagining an argument that turns a performance guarantee into a safety guarantee, like "I can prove that AlphaGo plays go at such-and-such Elo level, and therefore it must not be wireheading, because wireheaders aren't very good at playing Go." If you weren't thinking of performance guarantees, what "formal guarantees" are you thinking of?
(For what little it's worth, I'd be a bit surprised if we get a safety guarantee via a performance guarantee. It strikes me as more promising to reason about safety directly—e.g. "this algorithm won't seize control of the off-switch because blah blah incentives blah blah mesa-optimizers blah blah".)
Sorry if I'm still misunderstanding. :)
Thanks!! Here's where I'm at right now.
In the grandparent comment I suggested that if we want to make an AI that can learn sufficiently good hypotheses to do human-level things, perhaps the only way to do that is to make a "prior-building AI" with "agency" that is "trying" to build out its world-model / toolkit-of-concepts-and-ideas in fruitful directions. And I said that we have to solve the problem of how to build that kind of agential "prior-building AI" that doesn't also incidentally "try" to seize control of its off-switch.
Then in the parent comment you replied (IIUC) that if this is a problem at all, it's not the problem you're trying to solve (i.e. "finding good formal desiderata for safe TAI"), but a different problem (i.e. "developing learning algorithms with strong formal guarantees and/or constructing a theory of formal guarantees for existing algorithms"), and my problem is "to a first approximation orthogonal" to your problem, and my problem "receives plenty of attention from outside the existential safety community".
If so, my responses would be:
Or maybe another way of saying it would be: I think I put a lot of weight on the possibility that those "learning algorithms with strong formal guarantees" will turn out not to exist, at least not at human-level capabilities.
I guess, when I read "learning algorithms with strong formal guarantees", I'm imaging something like multi-armed bandit algorithms that have regret bounds. But I'm having trouble imagining how that kind of thing would transfer to a domain where we need the algorithm to discover new concepts and leverage them for making better predictions, and we don't know a priori what the concepts look like, or how many there will be, or how hard they will be to find, or how well they will generalize, etc.
Thanks for the comment!
First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.
Yeah even when I wrote this, I had already seen claims that the so-called MCTS is deterministic. But DeepMind and everyone else apparently calls it MCTS, and I figured I should just follow the crowd, and maybe this is just one of those things where terminology drifts in weird directions and one shouldn't think too hard about it, like how we say "my phone is ringing" when it's actually vibrating. :-P
Looking into it again just now, I'm still not quite sure what's going on. This person says AlphaZero switches from random to non-random after 15 moves. And this person says AlphaZero is totally deterministic but "MCTS" is still the proper term, for reasons that don't make any sense to me. I dunno and I'm open to being educated here. Regardless, if you tell me that I should call it "tree search" instead of "MCTS", I'm inclined to take your word for it. I want to be part of the solution not part of the problem :-D
NNs absolutely can plan in a 'pure' fashion: TreeQN (which they cite) constructs its own tree which it does its own planning/exploration over in a differentiable fashion.
That's an interesting example. I think I need to tone down my claim a bit (and edit the post). Thank you. I will now say exactly what I'm making of that example:
Here is a spectrum of things that one might believe, from most-scaling-hypothesis-y to least:
My impression is that you're at #2.
I put almost no weight on #6, and never have. In fact I'm a big believer in AGI sharing some aspects of DNNs, including distributed representations, and gradient descent (or at least "something kinda like gradient descent"), and learning from training data, and various kinds of regularization, etc. I think (uncoincidentally) that the family of (IMO neocortex-like) learning algorithms I was talking about in the main part of this post would probably have all those aspects, if they scale to AGI.
So my probability weight is split among #2,3,4,5.
To me, the question raised by the TreeQN paper is: should I shift some weight from #5 to #4?
When I look at the TreeQN paper, e.g. this source code file, I think I can characterize it as "they did some tree-structure-specific indexing operations". (You can correct me if I'm misunderstanding.) Are "tree-structure-specific indexing operations" included in my phrase "matrix multiplications, ReLUs, etc."? I dunno. Certainly stereotypical DNN code involves tons of indexing operations; it's not like it looks out of place! On the other hand, it is something that humans deliberately added to the code.
I guess in retrospect it was kinda pointless for me to make an argument for #5. I shouldn't have even brought it up. In the context of this post, I could have said: "The space of all possible learning algorithms is much vaster than Transformers—or even "slight tweaks on Transformers". Therefore we shouldn't take it for granted that either Transformers or "slight tweaks on Transformers" will necessarily scale to AGI—even if we believe in The Bitter Lesson."
And then a different (and irrelevant-to-this-post) question is whether "matrix multiplications, ReLUs, etc." (whatever that means) is a sufficiently flexible toolkit to span much of the space of all possible useful learning algorithms, in an efficiently-implemented way. My change from yesterday is: if I interpret this "toolkit" to include arbitrary indexing & masking operations, and also arbitrary control flow—basically, if this "toolkit" includes anything that wouldn't look out of place in today's typical DNN source code—then this is a much broader space of (efficiently-implemented) algorithms than I was mentally giving it credit for. This makes me more apt to believe that future AGI algorithms will be built using tools in this toolkit, but also more apt to believe that those algorithms could nevertheless involve a lot of new ideas and ingenuity, and less apt to believe that it's feasible for something like AutoML-Zero to search through the whole space of things that you can do with this toolkit, and less apt to describe the space of things you can build with this toolkit as "algorithms similar to DNNs". For example, here is a probabilistic program inference algorithm that's (at least arguably/partly) built using this "toolkit", and I really don't think of probabilistic program inference as "similar to DNNs".
I'm sure you already know this, but information can also travel a large distance in one hop, like when I look up at night and see a star. Or if someone 100 years ago took a picture of a star, and I look at the picture now, information has traveled 110 years and 10 light-years in just two hops.
But anyway, your discussion seems reasonable AFAICT for the case you're thinking of.
why do we "still have the whole AGI alignment / control problem in defining what this RL system is trying to do and what strategies it’s allowed to use to do it"? The objective is fully specified…
Thanks, that was a helpful comment. I think we're making progress, or at least I'm learning a lot here. :)
I think your perspective is: we start with a prior—i.e. the prior is an ingredient going into the algorithm. Whereas my perspective is: to get to AGI, we need an agent to build the prior, so to speak. And this agent can be dangerous.
So for example, let's talk about some useful non-obvious concept, like "informational entropy". And let's suppose that our AI cannot learn the concept of "informational entropy" from humans, because we're in an alternate universe where humans haven't yet invented the concept of informational entropy. (Or replace "informational entropy" with "some important not-yet-discovered concept in AI alignment.)
In that case, I see three possibilities.
It seems pretty clear to me that the third approach is way more dangerous than the second. In particular, the third one explicitly doing instrumental planning and metacognition, which seems like the same kinds of activities that could lead to the idea of seizing control of the off-switch etc.
However, my hypothesis is that the third approach can get us to human-level intelligence (or what I was calling a "superior epistemic vantage point") in practice, and that the other approaches can't.
So, I was thinking about the third approach—and that's why I said "we still have the whole AGI alignment / control problem" (i.e., aligning and controlling the "prior-building AI"). Does that help?