consider the fusion power generator scenario
It's possible that I misunderstood what you were getting at in that post. I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story? You could have equally well written the post as “Suppose, a few years from now, I set about trying to design a cheap, simple fusion power generator - something I could... (read more)
I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.
So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.” From that perspective, I would be concerned that if the (so-called) subroutine never wanted to do anything bad or stupid, the... (read more)
should be conceptually straightforward to model how humans would reason about those concepts or value them
Let’s say that the concept of an Em had never occurred to me before, and now you knock on my door and tell me that there’s a thing called Ems, and you know how to make them but you need my permission, and now I have to decide whether or not I care about the well-being of Ems. What do I do? I dunno, I would think about the question in different ways, I would try to draw analogies to things I already knew about, maybe I would read some philosophy papers,... (read more)
This approach is probably particularly characteristic of my approach.
Yeah, you were one of the “couple other people” I alluded to. The other was Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).
my approach … ontological lock …
I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.
Thanks! One of my current sources of mild skepticism right now (which again you might talk me out of) is:
Thanks! Follow-up question: Do you see yourself as working towards “Proof Strategy 2”? Or “none of the above”?
I would say “Humanity's current state is so spectacularly incompetent that even the obvious problems with obvious solutions might not be solved”.
If humanity were not spectacularly incompetent, then maybe we wouldn't have to worry about the obvious problems with obvious solutions. But we would still need to worry about the obvious problems with extremely difficult and non-obvious solutions.
The nonobvious problems are the whole reason why AI alignment is hard in the first place.
I disagree with the implication that there’s nothing to worry about on the “obvious problems” side.
An out-of-control AGI self-reproducing around the internet, causing chaos and blackouts etc., is an “obvious problem”. I still worry about it.
After all, consider this: an out-of-control virus self-reproducing around the human population, causing death and disability etc., is also an “obvious problem”. We already have this problem; we’ve had this problem for millenni... (read more)
I really don't know much about frontal lobotomy patients. I’ll irresponsibly speculate anyway.
I think “figuring out the solution to tricky questions” has a lot in common with “getting something tricky done in the real world”, despite the fact that one involves “internal” actions (i.e., thinking the appropriate thoughts) and the other is “external” actions (i.e., moving the appropriate muscles). I think they both require the same package of goal-oriented planning, trial-and-error exploration via RL, and so on. (See discussion of “RL-on-thoughts” h... (read more)
I guess it depends on “how fast is fast and how slow is slow”, and what you say is true on the margin, but here's my plea that the type of thinking that says “we want some technical problem to eventually get solved, so we try to solve it” is a super-valuable type of thinking right now even if we were somehow 100% confident in slow takeoff. (This is mostly an abbreviated version of this section.)
Huh. I would have invoked a different disorder.
I think that if we replace the Thought Assessor & Steering Subsystem with the function “RPE = +∞ (regardless of what's going on)”, the result is a manic episode, and if we replace it with the function “RPE = -∞ (regardless of what's going on)”, the result is a depressive episode.
In other words, the manic episode would be kinda like the brainstem saying “Whatever thought you're thinking right now is a great thought! Whatever you're planning is an awesome plan! Go forth and carry that plan out with gusto!!!!... (read more)
If it makes sense to differentiate the "Thought Generator" and "Thought Assessor" as two separate modules, is it possible to draw a parallel to language models, which seem to have strong ability to generate sentences, but lack the ability to assess if they are good?
Hmm. An algorithm trained to reproduce human output is presumably being trained to imitate the input-output behavior of the whole system including Thought Generator and Thought Assessor and Steering Subsystem.
I’m trying to imagine deleting the Thought Assessor & Steering Subsystem, and repla... (read more)
I really like this. Ever since I read your first model splintering post, it's been a central part of my thinking too.
I feel cautiously optimistic about the prospects for generating multiple hypotheses and detecting when they come into conflict out-of-distribution (although the details are kinda different for the Bayes-net-ish models that I tend to think about then the deep neural net models that I understand y'all are thinking about).
I remain much more confused about what to do when that detector goes off, in a future AGI.
I imagine a situation where some i... (read more)
I think I weakly disagree with the implication that “distillation” should be thought of as a different category of activity from “original research”. It is in a superficial sense, but a lot of the underlying activities and skills and motivations overlap. For example, original researchers also have the experience of reading something, feeling confused about it, and then eventually feeling less confused about it. They just might not choose to spend the time writing up how they came to be less confused. Conversely, someone trying to understand something for t... (read more)
I think you're saying: if a thing is messy, at least there can be a non-messy procedure / algorithm that converges to (a.k.a. points to) the thing. I think I'm with Charlie in feeling skeptical about this in regards to value learning, because I think value learning is significantly a normative question. Let me elaborate:
My genes plus 1.2e9 seconds of experience have built have built a fundamentally messy set of preferences, which are in some cases self-inconsistent, easily-manipulated, invalid-out-of-distribution, etc. It's easy enough to point to the set ... (read more)
There was one paragraph from the podcast that I found especially enlightening—I excerpted it here (Section 3.2.3).
I’m advocating for the first one—P is trying to predict the next ground-truth injection. Does something trouble you about that?
I think of myself as a generally intelligent agent with a compact world model
In what sense? Your world-model is built out of ~100 trillion synapses, storing all sorts of illegible information including “the way my friend sounds when he talks with his mouth full” and “how it feels to ride a bicycle whose gears need lubrication”.
(or a compact function which is able to estimate and approximate a world model)
That seems very different though! The GPT-3 source code is rather compact (gradient descent etc.); combine it with data and you get a huge and extraordina... (read more)
RE legibility: In my mind, I don’t normally think there’s a strong connection between agent foundations and legibility.
If the AGI has a common-sense understanding of the world (which presumably it does), then it has a world-model, full of terabytes of information of the sort “tires are usually black” etc. It seems to me that either the world-model will be either built by humans (e.g. Cyc), or (much more likely) learned automatically by an algorithm, and if it’s the latter, it will be unlabeled by default, and it’s on us to label it somehow, and there’s no ... (read more)
Here's a twitter thread wherein Nathaniel Daw gently pushes back on my dopamine neuron discussion of Section 5.5.6.
Huh. I'm under the impression that "offense-defense balance for technology-inventing AGIs" is also a big cruxy difference between you and Eliezer.
Specifically: if almost everyone is making helpful aligned norm-following AGIs, but one secret military lab accidentally makes a misaligned paperclip maximizer, can the latter crush all competition? My impression is that Eliezer thinks yes: there's really no defense against self-replicating nano-machines, so the only paths to victory are absolutely perfect compliance forever (which he sees as implausible, given s... (read more)
Right, I think there's one reward function (well, one reward function that's relevant for this discussion), and that for every thought we think, we're thinking it because it's rewarding to do so—or at least, more rewarding than alternative thoughts. Sometimes a thought is rewarding because it involves feeling good now, sometimes it's rewarding because it involves an expectation of feeling good in the distant future, sometimes it's rewarding because it involves an expectation that it will make your beloved friend feel good, sometimes it's rewarding b... (read more)
I wrote Consequentialism & Corrigibility shortly after and partly in response to the first (Ngo-Yudkowsky) discussion. If anyone has an argument or belief that the general architecture / approach I have in mind (see the “My corrigibility proposal sketch” section) is fundamentally doomed as a path to corrigibility and capability—as opposed to merely “reliant on solving lots of hard-but-not-necessarily-impossible open problems”—I'd be interested to hear it. Thanks in advance. :)
Again I don't have an especially strong opinion about what our prior should be on possible motivation systems for an AGI trained by straightforward debate, and in particular what fraction of those motivation systems are destructive. But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that. I'll assume the AGI uses model-based RL, which would (I claim) put it very roughly in the same category as humans.
Some aspects of motivation have an obvious relationship / correlation to the reward signal. In the human cas... (read more)
The RL-on-thoughts discussion was meant as an argument that a sufficiently capable AI needs to be “trying” to do something. If we agree on that part, then you can still say my previous comment was a false dichotomy, because the AI could be “trying” to (for example) “win the debate while following the spirit of the debate rules”.
And yes I agree, that was bad of me to have listed those two things as if they're the only two options.
I guess what I was thinking was: If we take the most straightforward debate setup, and if it gets an AI that is “trying” to do so... (read more)
is there a little goal-oriented mind inside there that solves science problems the same way humans solve them, by engineering mental constructs that serve a goal of prediction, including backchaining for prediction goals and forward chaining from alternative hypotheses / internal tweaked states of the mental construct?
In case it helps anyone to hear different people talking about the same thing, I think Eliezer in this quote is describing a similar thing as my discussion here (search for the phrase “RL-on-thoughts”).
So my objection to debate (which again I... (read more)
The predictor is a parametrized function output = f(context, parameters) (where "parameters" are also called "weights"). If (by assumption) context is static, then you're running the function on the same inputs over and over, so you have to keep getting the same answer. Unless there's an error changing the parameters / weights. But the learning rate on those parameters can be (and presumably would be) relatively low. For example, the time constant (for the exponential decay of a discrepancy between output and supervisor when in "override mode") could be ma... (read more)
I'm not 100% sure and didn't chase down the reference, but in context, I believe the claim “the [infant decorticate rats] appear to suckle normally and develop into healthy adult rats” should be read as “they find their way to their mother's nipple and suckle”, not just “they suckle when their mouth is already in position”.
Pathfinding to a nipple doesn't need to be “pathfinding” per se, it could potentially be as simple as moving up an odor gradient, and randomly reorienting when hitting an obstacle. I dunno, I tried watching a couple videos of neonatal mi... (read more)
I find myself skeptical of treating e.g. the behavior of Virginia opossum newborns as either solely driven by the hypothalamus and brainstem ("newborn opossums climb up into the female opossum's pouch and latch onto one of her 13 teats.", especially when combined with "the average litter size is 8–9 infants") or learnt from scratch (among other things, gestation lasts 11–13 days).
Hmm. Why don't you think that behavior might be solely driven by the hypothalamus & brainstem?
For what it's worth, decorticate infant rats (rats whose cortex was surgical... (read more)
Thanks for the great comment!
if the upstream circuit learns entirely from scratch, you can't really have hardwired downstream predictors, for lack of anything stable to hardwire them to. I don't see a clear argument for the premise.
That would be Post #2 :-)
consider the following hilariously oversimplified sketch of how to have hardwired predictors in an otherwise mainly-learning-from-scratch circuit…
I don't have strong a priori opposition to this (if I understand it correctly), although I happen to think that it's not how any part of the brain works.
If it ... (read more)
Hmm, the only overlap I can see between your recent work and this description (including optimism about very-near-term applications) is the idea of training an ensemble of models on the same data, and then if the models disagree with each other on a new sample, then we're probably out of distribution (kinda like the Yarin Gal dropout ensemble thing and much related work).
And if we discover that we are in fact out of distribution, then … I don't know. Ask a human for help?
If that guess is at all on the right track (very big "if"!), I endorse it as a promisi... (read more)
I've been assuming the latter... Seems to me that there's enough latency in the whole system that it can be usefully reduced somewhat without any risk of reducing it below zero and thus causing instability etc.
I can imagine different signals being hardwired to different time-intervals (possibly as a function of age).
I can also imagine the interval starts low, and creeps up, millisecond by millisecond over the weeks, as long as the predictions keep being accurate, and conversely creeps down when the predictions are inaccurate. (I think that would work in pr... (read more)
I like this post. I'm not sure how decision-relevant it is for technical research though…
If there isn't a broad basin of attraction around human values, then we really want the AI (or the human-AI combo) to have "values" that, though they need not be exactly the same as the human, are at least within the human distribution. If there is a broad basin of attraction, then we still want the same thing, it's just that we'll ultimately get graded on a more forgiving curve. We're trying to do the same thing either way, right?
(Warning: I might not be describing this well. And might be stupid.)
I feel like there's an alternate perspective compared to what you're doing, and I'm trying to understand why you don't take that path. Something like: You're taking the perspective that there is one Bayes net that we want to understand. And the alternative perspective is: there is a succession of "scenarios", and each is its own Bayes net, and there are deep regularities connecting them—for example they all obey the laws of physics, there's some relation between the "chairs" in one scenari... (read more)
the attractor basin is very thin along some dimensions, but very thick along some other dimensions
There was a bunch of discussion along those lines in the comment thread on this post of mine a couple years ago, including a claim that Paul agrees with this particular assertion.
I don't think anyone except for Jeff Hawkin believes in literal cortical uniformity.
Not even him! Jeff Hawkins: "Mountcastle’s proposal that there is a common cortical algorithm doesn’t mean there are no variations. He knew that. The issue is how much is common in all cortical regions, and how much is different. The evidence suggests that there is a huge amount of commonality."
I mentioned "non-uniform neural architecture and hyperparameters". I'm inclined to put different layer thicknesses (including agranularity) in the category of "non-uniform hyp... (read more)
Thanks! I'm not sure we have much disagreement here. Some relevant issues are:
Good question! Here are a couple more specific open questions and how I think about them:
For #1, my answer is "Wow, that would be super duper awesome, and it would make me dramatically more optimistic, so I sure hope someone figures out... (read more)
If I'm not mistaken, the things you brought up are at too low a level to be highly relevant for safety, in my opinion. I guess this series will mostly be at Marr's "computational level" whereas you're asking "algorithm-level" questions, more or less. I'll be talking a lot about things vaguely like "what is the loss function" and much less about things vaguely like "how is the loss function minimized".
For example, I think you can train a neural Turing machine with supervised learning, or you can train a neural Turing machine with RL. That distinction... (read more)
(Warning: thinking out loud.)
Hmm. Good points.
Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence. Even if within that argument, everything points to "I'm in a simulation etc.", there's a big heap of "is metacosmology really what I should be thinking about?"-type uncertainty on top. At least for me.
I think "people who do counterintuitive things" for religious reasons usually have more direct motivations—maybe they have mental health issues ... (read more)
Thanks for that! But I share with OP the intuition that these are weird failure modes that come from weird reasoning. More specifically, it's weird from the perspective of human reasoning.
It seems to me that your story is departing from human reasoning when you say "you posses a great desire to help whomever has summoned you into the world". That's one possible motivation, I suppose. But it wouldn't be a typical human motivation.
The human setup is more like: you get a lot of unlabeled observations and assemble them into a predictive world-model, and you al... (read more)
I still think this post is correct in spirit, and was part of my journey towards good understanding of neuroscience, and promising ideas in AGI alignment / safety.
But there are a bunch of little things that I got wrong or explained poorly. Shall I list them?
First, my "neocortex vs subcortex" division eventually developed into "learning subsystem vs steering subsystem", with the latter being mostly just the hypothalamus and brainstem, and the former being everything else, particularly the whole telencephalon and cerebellum. The main difference is that the "... (read more)
By the same token, I think every neurotypical human thinking about Newcomb's problem is using counterfactual reasoning, and I think that there isn't any interesting difference in the general nature of the counterfactual reasoning that they're using.I find this confusing as CDT counterfactuals where you can only project forward seem very different from things like FDT where you can project back in time as well.
By the same token, I think every neurotypical human thinking about Newcomb's problem is using counterfactual reasoning, and I think that there isn't any interesting difference in the general nature of the counterfactual reasoning that they're using.
I find this confusing as CDT counterfactuals where you can only project forward seem very different from things like FDT where you can project back in time as well.
I think there is "machinery that underlies counterfactual reasoning" (which incidentally happens to be the same as "the machinery that underlies imag... (read more)
I like this post but I'm a bit confused about why it would ever come up in AI alignment. Since you can't get an "ought" from an "is", you need to seed the AI with labeled examples of things being good or bad. There are a lot of ways to do that, some direct and some indirect, but you need to do it somehow. And once you do that, it would presumably disambiguate "trust public-emotional supervisor" from "trust private-calm supervisor".
Hmm, maybe the scheme you have in mind is something like IRL? I.e.: (1) AI has a hardcoded template of "Boltzmann rational agen... (read more)
I also think that there are lots of specific operations that are all "counterfactual reasoning"Agreed. This is definitely something that I would like further clarity on
I also think that there are lots of specific operations that are all "counterfactual reasoning"
Agreed. This is definitely something that I would like further clarity on
Hmm, my hunch is that you're misunderstanding me here. There are a lot of specific operations that are all "making a fist". I can clench my fingers quickly or slowly, strongly or weakly, left hand or right hand, etc. By the same token, if I say to you "imagine a rainbow-colored tree; are its leaves green?", there are a lot of different specific mental models that you might be invoking. (It c... (read more)
I think brains build a generative world-model, and that world-model is a certain kind of data structure, and "counterfactual reasoning" is a class of operations that can be performed on that data structure. (See here.) I think that counterfactual reasoning relates to reality only insofar as the world-model relates to reality. (In map-territory terminology: I think counterfactual reasoning is a set of things that you can do with the map, and those things are related to the territory only insofar as the map is related to the territory.)
I also think that ther... (read more)
How much are you interested in a positive vs normative theory of counterfactuals? For example, do you feel like you understand how humans do counterfactual reasoning, and how and why it works for them (insofar as it works for them)? If not, is such an understanding what you're looking for? Or do you think humans are not perfect at counterfactual reasoning (e.g. maybe because people disagree with each other about Newcomb's problem etc.) and there's some deep notion of "correct counterfactual reasoning" that humans are merely approximating, and the deeper "c... (read more)
I’ll set aside what happens “by default” and focus on the interesting technical question of whether this post is describing a possible straightforward-ish path to aligned superintelligent AGI.
The background idea is “natural abstractions”. This is basically a claim that, when you use an unsupervised world-model-building learning algorithm, its latent space tends to systematically learn some patterns rather than others. Different learning algorithms will converge on similar learned patterns, because those learned patterns are a property of the world, not an ... (read more)
Yes, what you said. The opposite of "a human-legible learning algorithm" is "a nightmarishly-complicated Rube-Goldberg-machine learning algorithm".
If the latter is what we need, we could still presumably get AGI, but it would involve some automated search through a big space of many possible nightmarishly-complicated Rube-Goldberg-machine learning algorithms to find one that works.
That would be a different AGI development story, and thus a different blog post. Instead of "humans figure out the learning algorithm" as an exogenous input to the path-to-AGI, w... (read more)
I understood the idea of Paul's post as: if we start in a world where humans-with-aligned-AIs control 50% of relevant resources (computers, land, minerals, whatever), and unaligned AIs control 50% of relevant resources, and where the strategy-stealing assumption is true—i.e., the assumption that any good strategy that the unaligned AIs can do, the humans-with-aligned-AIs are equally capable of doing themselves—then the humans-with-aligned-AIs will wind up controlling 50% of the long-term future. And the same argument probably holds for 99%-1% or any other ... (read more)