Working on AGI safety via a deep-dive into brain algorithms, see https://sjbyrnes.com/agi.html
I agree about the general principle, even if I don't think this particular thing is an example because of the "not maximizing sum of future rewards" thing.
Hmm, I guess I mostly disagree because:
The way I'm thinking about AGI algorithms (based on how I think the neocortex works) is, there would be discrete "features" but they all come in shades of applicability from 0 to 1, not just present or absent. And by the same token, the reward wouldn't perfectly align with any "features" (since features are extracted from patterns in the environment), and instead you would wind up with "features" being "desirable" (correlated with reward) or "undesirable" (anti-correlated with reward) on a continuous scale from -∞ to +∞. And the agent would try to bring about "desirable" things rather than maximize reward per se, since the reward may not perfectly line up with anything in its ontology / predictive world-model. (Related.)
So then you sometimes have "a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y".
That kinda has some spiritual similarity to model splintering I think, but I don't think it's exactly the same ... for example I don't think it even requires a distributional shift. (Or let me know if you disagree.) I don't see how to import your model splintering ideas into this kind of algorithm more faithfully than that.
Anyway, I agree with "conservatism & asking for advice". I guess I was thinking of conservatism as something like balancing good and bad aspects but weighing the bad aspects more. So maybe "a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y" is actually net undesirable, because the Y outweighs the X, after getting boosted up by the conservatism correction curve.
And as for asking for advice, I was thinking, if you get human feedback about this specific thing, then after you get the advice it would pattern-match 100% to desirable feature Z, and that outweighs everything else.
As for "when advice fails", I do think you ultimately need some kind of corrigibility, but earlier on there could be something like "the algorithm that chooses when to ask questions and what questions to ask does not share the same desires as the algorithm that makes other types of decisions", maybe.
One thing is, I'm skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there's a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don't think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):
The robot is supposed to be really good at manufacturing widgets (for example), and that task requires real-world foresighted planning, because sometimes it needs to substitute different materials, negotiate with suppliers and customers, repair itself, etc. Given that the model definitely needs to have capability of real-world foresighted planning and self-awareness and so on, the simplest high-performing model is plausibly one that applies those capabilities towards a maximally simple goal, like "making its camera pixels all white" or whatever, and then that preserves performance because of instrumental convergence.
(Correct me if I'm misunderstanding!)
If that's the argument, it seems not to apply here, because this task doesn't require real-world foresighted planning.
I expect that a model that can't do any real-world planning at all would be simpler than a model that can. In the RL robot example, it doesn't matter, because a model that can't do any real-world planning at all would do terribly on the objective, so who cares if it's simpler. But here, it would be equally good at the objective, I think, and simpler.
(A possible objection would be: "real-world foresighted planning" isn't a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like "building predictive models" and "searching over strategies" and whatnot. I think I would disagree with that objection, but I don't have great certainty here.)
I think it can be simultaneously true that, say:
The second one doesn't stop being true because the first one is also true. They can both be true, right?
In other words, "the model weights are what they are because it's the simplest way to solve the problem" doesn't eliminate other "why" questions about all the details of the model. There's still some story about why the weights (and the resulting processing steps) are what they are—it may be a very complicated story, but there should (I think) still be a fact of the matter about whether that story involves "the algorithm itself having downstream impacts on the future in non-random ways that can't be explained away by the algorithm logic itself or the real-world things upstream of the algorithm". Or something like that, I think.
I think you're misunderstanding (or I am).
I'm trying to make a two step argument:
(1) SGD under such-and-such conditions will lead to a trained model that does exclusively within-universe processing [this step is really just a low-confidence hunch but I'm still happy to discuss and defend it]
(2) trained models that do exclusively within-universe processing are not scary [this step I have much higher confidence in]
If you're going to disagree with (2), then SGD / "what the model was selected" for is not relevant.
"Doing exclusively within-universe processing" is a property of the internals of the trained model, not just the input-output behavior. If running the trained model involves a billion low-level GPU instructions, this property would correspond to the claim that each and every one of those billion GPU instructions is being executed for reasons that are unrelated to any anticipated downstream real-world consequences of that GPU instruction. (where "real world" = everything except the future processing steps inside the algorithm itself.)
The kind of incentive argument I'm trying to make here is "If the model isn't doing X, then by doing X a little bit it will score better on the objective, and by doing X more it will score even better on the objective, etc. etc." That's what I mean by "X is incentivized". (Or more generally, that gradient descent systematically tends to lead to trained models that do X.) I guess my description in the article was not great.
So in general, I think deceptive alignment is "incentivized" in this sense. I think that, in the RL scenarios you talked about in your paper, it's often the case that building a better and better deceptively-aligned mesa-optimizer will progressively increase the score on the objective function.
Then my argument here is that 4th-wall-breaking processing is not incentivized in that sense: if the trained model isn't doing 4th-wall-breaking processing at all right now, I think it does not do any better on the objective by starting to do a little bit of 4th-wall-breaking processing. (At least that's my hunch.)
(I do agree that if a deceptively-aligned mesa-optimizer with a 4th-wall-breaking objective magically appeared as the trained model, it would do well on the objective. I'm arguing instead that SGD is unlikely to create such a thing.)
Oh, I guess you're saying something different: that even a deceptive mesa-optimizer which is entirely doing within-universe processing is nevertheless scary. So that would by definition be an algorithm with the property "no operation in the algorithm is likelier to happen vs not happen specifically because of anticipated downstream chains of causation that pass through things in the real world". So I can say categorically: such an algorithm won't hurt anyone (except by freak accident), won't steal processing resources, won't intervene when I go for the off-switch, etc., right? So I don't see "arbitrarily scary", or scary at all, right? Sorry if I'm confused…
Interesting. Is it fair to say that Mollick's system is relatively more "serial" with fewer parallelisms at the subcortical level, whereas you're proposing a system that's much more "parallel" because there are separate systems doing analogous things at each level? …
Hmm, I guess I'm not really sure what you're referring to.
Apropos of nothing, is there any role for the visual cortex within your system?
If I recall, V1 isn't involved in basal ganglia loops, and some higher-level visual areas might project to striatum as "context" but not as part of basal ganglia loops. (I'm not 100% clear on the anatomy here though; I think the literature is confusing to me partly because it took me a while to realize that rat visual cortex is a lot simpler than primate, I've heard it's kinda like "just V1"). So that's the message of "Is RL Involved in Sensory Processing?": there's no RL in the visual cortex AFAICT. Instead I think there's predictive learning, see for example Randall O'Reilly's model.
I talk in the main article about "proposal selection". I think the cortex is just full of little models that make predictions about other little models, and/or predictions about sensory inputs, and/or (self-fulfilling) "predictions" about motor outputs. And if a model is making wrong predictions, it gets thrown out, and over time it gets outright deleted from the system. (The proposals are models too.) So if you're staring at a dog, you just can't seriously entertain the proposal "I'm going to milk this cow". That model involves a prediction that the thing you're looking at is a cow, and that model in turn is making lower-level predictions about the sensory inputs, and those predictions are being falsified by the actual sensory input, which is a dog not a cow. So the model gets thrown out. It doesn't matter how high reward you would get for milking a cow, it's not on the table as a possible proposal.
I believe I noted that the within-cortex proposal-selection / predictive learning algorithms are important things, but declared them out of scope for this particular post.
The last time I wrote anything about the within-cortex algorithm was I guess last year here. These days I'm more excited by the question of "how might we control neocortex-like algorithms?" rather than "how exactly would a neocortex-like algorithm work?"
I too am puzzled about why some people talk about "mPFC" and others talk about "vmPFC"…
Thanks, that was helpful
I think you're interpreting "prediction" and "postdiction" differently than me.
Like, let's say GPT-3 is being trained to guess the next word of a text. You mask (hide) the next word, have GPT-3 guess it, and then compare the masked word to the guess and make an update.
I think you want to call the guess a "prediction" because from GPT-3's perspective, the revelation of the masked data is something that hasn't happened yet. But I want to call the guess a "postdiction" because the masked data is already "locked in" at the time that the guess is formed. The latter is relevant when we're thinking about incentives to form self-fulfilling prophecies.
Incidentally, to be clear, people absolutely do make real predictions constantly. I'm just saying we don't train on those predictions. I'm saying that by the time the model update occurs, the predictions have already been transmuted into postdictions, because the thing-that-was-predicted has now already been "locked in".
(Sorry if I'm misunderstanding.)