For example, one might assess whether the current thought will lead to satisfying the AGI’s curiosity drive, another its altruism drive, etc. The Steering Subsystem combines these into an aggregate reward. But the function that it uses to do so is a hardcoded, human-legible function—e.g., it might be as simple as a weighted average.
Huh. I wonder how much human variation in personality is basically intraspecies variation in the weightings on the reward function in the steering subsystem.
It sure seems like the big five factors could correspond to higher or lower weightings on, or higher or lower sensitivities to, different kinds of hard-coded reward signals.
Openness - Insight or curiosity, maybe counterbalanced by disgust or confusion or uncertainty
Extraversion - Positive social reinforcement (maybe counterbalanced by negative social reinforcement?)
Neuroticism - Fear or anxiety
Agreeableness - Some different kind of social reinforcement?
Conscientiousness - ??
It sure seems like different mixes of priorities on similar baskets of reward signals would shape behavior in importantly different ways, which would manifest at a macro level as personality characteristics.
When you’re driving a car, it is a critically important safety requirement that when you turn the steering wheel, the wheels respond instantaneously. By the same token, I expect that it will be a critically important safety requirement for humans to be able to change an AGI’s deepest desires instantaneously when we press the appropriate button. So I think this is an awesome feature, and I’m happy to have it, even if I’m not 100% sure exactly what to do with it. (In a car, you can see where you’re going, whereas understanding what the AGI is trying to do at any given moment is much more fraught.)
Is there an additional claim that the AI will not exhibit the standard problems of corrigibility (ie it won't stop you from changing its reward function) because it's not natively an expected maximizer?
That is, even though the AI's generative world model knows that it will fail to accomplish it's current goals if the reward function is changed, the thought assessors haven't had the opportunity to learn that, and don't know it? The AI understands this "intellectually" but not "viscerally"? It's like the common human relationship with cocaine?
...
That doesn't seem right to me. If the AI is playing a video game, and it is about to be attacked by a powerful adversary that will bring it's hit points to 0, it will correctly identify that this upcoming event will harm it's goals, and take action to prevent it.
The AI doesn't need to have experienced "loosing all its hit points from this particular enemy" or even "loosing all its hit points", for the thought assessors to give high valence to thoughts/plans to prevent that bad outcome. It just has to assign high valence to the concept of winning the game, and it it will assign high valence to harm-preventing-actions, right?
This is disanalogous to the case of humans deciding not to wirehead with drugs because reward is not the optimization target. But "concepts that the thought assessors have awarded high valance" are the optimization target (more or less). Finding out that something will directly impact your reward is not motivating. Finding out that something will impact one of your high-valence goals is motivating.
This makes it seem like even if realtime steering is mechanistically afforded by the AGI design, the AGI will take steps to prevent (most) alterations to its reward function in its steering subsystem, by default.
It's only "most" alterations to its reward function, because some alterations to the reward function will increase AGIs effectiveness at accomplishing it's high-valence-concept goals. And noticing that, thoughts/plans to modify the reward function accordingly will be awarded high valence. The AGI will want to self modify in ways that support its existing high valence goals, but not to be modified (by anyone) in ways that don't support its existing high valence goals.
I think it’s literally true, although I could be wrong, happy to discuss.
I know a lot less neuroscience than you do, and so I'm happy to take your word for it.
It sounds like you're saying there are two different memory systems and they effectively store duplicates of ~ all the memories that the second system stories (since the first system stores additional memories that aren't stored by the second system)?
I would have guessed that Rohypnol works by preventing the whole cortex (including HC) from editing synapses. But that’s just a guess. Do you have a reason to think otherwise?
Just that people's short term memory is not impaired by Rohypnol. There are other side effects, like impaired judgement, but I believe (and Claude confirms to me) that you'll still be able to remember someone's name from a couple of minutes ago—long enough that you couldn't have been holding it in working memory.
(I had guessed that that was because they were being represented in the hippocampus but were blocked from being encoded in the (iso)cortex. But it sounds like you don't think that's what's going on, and I have no reason to doubt you.)
However, if the mechanism of action of Rohypnol is preventing updates to the synapses, as you suggest, that implies that short term memories are not represented by synaptic weights.
The basic point that it's possible to disrupt long term memory formation without disrupting short term memory function seems to suggest that they're separate processes. That at least makes it less definitive that remembering something from 60 seconds ago is an example of online learning in the sense of "permanently and rapidly changing your world model on the fly". It might be that, or it might be more like writing to some kind of temporary memory medium and then later training on that memory.
If you introduce yourself to me as “Fred”, and then 60 seconds later I refer to you as “Fred”, then I can thank online learning for putting that bit of knowledge into my brain.
I don't think this is literally true? Do short term memories literally change the synaptic weights in the brain?
I thought that memory formation happens in the hippocampi, and then the memory traces gradually "migrate" to the cortex. And if you give people drugs like Rohypnol you'll block that process, and so they won't form long term memories. But they'll still be able to form and use short term memories, like the name of a person they were introduced to 60 seconds ago.
It seems like, with humans, short term memory is more like in context learning (though maybe that's also a bad analogy for what the hippocampi are doing) and then there's some other process (occurring mostly during sleep?) that consolidates those short term memories: encoding them as synaptic weights in the cortex.
Which is to say that the training is not entirely online. It's semi-episode-like, but with an added mechanism that uses short term learning as training data for another episode of training (or something).
I get that this post is skipping over a lot of the mechanistic details, but I'll observe that that there are important behavioral differences between the two hypotheses...
The thought assessors award valence to a thought in proportion to how strongly it activates a high-yumminess concept.
The thought assessors award valence to a thought in proportion to their estimate (or the cortex's own estimate?) of how likely the thought is to lead to actually experiencing the high-yumminess experience.
...because those two things come apart. There are some thoughts that will strongly activate a concept of an experience, but have low probability of leading to that experience, and vis versa.
As an example, I would bet that vividly imagining eating Prinsesstårta cake activates the concept "eating Prinsesstårta cake" much more going to google maps, so that you can find an ATM, so that you can get cash, so that you can go to the store and buy eggs and milk, so that you can bake a Prinsesstårta cake, so you can eat it. But the second thought seems more likely to result in you actually eating the cake!
Though, in practice, it seems like both kinds of behavior (vividly fantasizing and making multi-step plans) both occur, some of the time, so it might be both, or something similar enough to both.
The thing that was most confusing for me was step 5.
I first needed to clarify for myself that because the STP's output was feeding back in as input, it could adopt any value in that stage.
Then I needed to realize that, given that it could take any value, the optimal predictive strategy is to adopt the the value of the next predicted override "just to be safe".
(I don't know if that helps you at all.)
...
Actually, this begs the question a bit.
Suppose that a STP is in the self-looping mode at t_0. The next override will arrive at t_10. Also, suppose that the context clues at t_8, are very strongly informative of the timing of the next ground truth injection: it will occur sometime within the next three timesteps, 10_9 to 10_11.
There's definitely pressure to start predicting the override at t_8, but is there a pressure pushing the STP to start predicting the override all the way back at t_0? Why not have random outputs until t_8, and then switch to predicting the override?
Is it just that there are rarely definitive contextual evidence that allows you to time the ground truth injection, even with a wide interval?
If the STP can precisely bound the timing of the overrides does that break this whole system?
Here's my understanding of the model presented in this post. Please me know if I'm missing or misunderstanding anything.
The learning subsystem of the brain includes a bunch of short term predictors. These predictors take in information from the all over the brain to predict values in the steering subsystem, in particular.
These short term predictors get feedback by sending their predictions down to the steering subsystem, and in return get supervision signals that they can use to learn their policy.
The steering subsystem will often just turn around and send the prediction back as the supervision signal. When this happens, it creates a feedback loop of the predictor predicting itself.
When the short term predictor is in this self-loop mode, it has degrees of freedom about what state it's in. Any prediction that it offers is a "correct answer".
However, the steering subsystem will send a ground truth signal at some point. So the best strategy for a short term predictor in a self-loop mode is to jump to predicting the next override as soon as it can guess what the next override will be.
For instance, as soon as that short term predictor notices a change in context like "you had the thought to go put on a sweater", it immediately starts predicting the positive reward of "feel warm and cozy" from having put on a sweater.
It could self-stably give any prediction. But immediately predicting the reward of putting on a sweater is the best policy. It won't get penalized while it's in the self-loop mode (since whatever the short term outputs is marked correct) but when the ground truth injection does arrive the short term predictor will already be correctly predicting that ground truth injection.
That is, jumping to predicting the next ground truth injection as soon as the predictor can guess what it will be does no worse than any other possible prediction during the self-loop mode, and it does better than other possible predictions during the steering override mode.
This feedback makes the short-term predictors into long term predictors, because they're effectively learning to predict the next "steering override", based on context. This allows the short term predictors to learn predictive patterns that occur across many different timescales.
The timescale that they learn to predict over depends the on frequency with which the steering subsystem provides ground truth injections: if it's around once a minute, the predictor learns to predict a minute into the future, if it's around once an hour, the predictor learns to predict an hour into into the future and so on. (And if the frequency of ground truth injections are variable, but predictable from brain context, the predictors will learn a policy that predicts over different time horizons depending on that context?)
Those long term predictors can be used as control signals. These long term predictors can steer behavior based on outcomes that will only occur minutes or hours into the future.
In the sweater example, the brain can use the expectation of feeling warm and cozy, or not, as an indicator that the current behavior is on track for producing the warm and cozy reward.
This setup constructs a long term reward-seeker out of an immediate reward-seeker and a pile of predictors.
But the more interesting question is: what was happening during the thirty seconds that it took me to walk upstairs?I evidently had motivation to continue walking, or I would have stopped and turned around. But my brainstem hadn’t gotten any ground truth yet that there were good things happening. That’s where “defer-to-predictor mode” comes in! The brainstem, lacking strong evidence about what’s happening, sees a positive valence guess coming out of the striatum and says, in effect, “OK, sure, whatever, I’ll take your word for it.”
It seems like there's some implication here that motivation and positive valence are the same thing?
Is the claim that evolutionarily early versions of behavioral circuits had approximately the form...
If positive reward:
continue current behavior
else:
try something else
...but that adding in long-term predictors instead allows for the following algorithm?
Perhaps I'm just being dense, but I'm confused why this toy model of a long-term predictor is long-term instead of short term. I'm trying to think through it aloud in this comment.
A “long-term predictor” is ultimately nothing more than a short-term predictor whose output signal helps determine its own supervisory signal. Here’s a toy model of what that can look like:
At first, I thought that the idea was that the latency of the supervisory/error signal was longer than average, and that that latency made the short term predictor function as a long-term predictor, without being any different functionally. But then why is it labeled "short-term predictor"?
It seems like the short-term predictor should learn to predict (based on context cues) the behavior triggered by the hardwired circuitry. But it should predict that behavior only 0.3 seconds early?
...
Oh! Is the key point that there's a kind of resonance, where this system maintains the behavior of the genetically hardwired components? When the switch switches back to defer-to-predictor mode, the short term predictor is still predicting the override hard-wired behavior, which is now trivially "correct", because whatever the predictor outputs is correct. (It was also correct a moment before, when the switch was in override mode, but not trivially correct.)
This still doesn't answer my confusion. It seems like the whole circuit is going to maintain the state from the last "ground truth infusion" and learn to predict the timings and magnitudes of the "ground truth infusions". But it still shouldn't predict them more than 0.3 seconds in advance?
Is the idea that the lookahead propagates earlier and earlier with each cycle? You start with a 0.3 second prediction. But that means that supervisory signal (when in the "defer-to-predictor mode") is 0.3 seconds earlier, which means that the predictor learns to predict the change in output 0.6 seconds ahead of when the override "would have happened", and then 0.9 seconds ahead , and then 1.2 seconds ahead, and so on, until it backs all the way up to when the "prior" ground truth infusion sent a different signal?
Like, the thing that this circuit is doing is simulating time travel, so that it can activate (on average) the next behavior that the genetically hardwired circuitry will output, as soon as "override mode" is turned off?
As we argued for at the time, training on a purely predictive loss should, even in the limit, give you a predictor, not an agent—and we’ve now seen this stay true even through substantial scaling (though there is still some chance this will break at some point).
Is there anyone who significantly disputes this?
I'm not trying to ask a rhetorical question ala "everyone already thinks this, this isn't an update". I'm trying to ascertain if there's a consensus on this point.
I've understood Eliezer to sometimes assert something like "if you optimize a system for sufficiently good predictive power, a consequentialist agent will fall out, because an agent is actually the best solution to a broad range of prediction tasks."
[Though I want to emphasize that that's my summary, which he might not endorse.]
Does anyone still think that or something like that?
Huh. I wonder how much human variation in personality is basically intraspecies variation in the weightings on the reward function in the steering subsystem.
It sure seems like the big five factors could correspond to higher or lower weightings on, or higher or lower sensitivities to, different kinds of hard-coded reward signals.
It sure seems like different mixes of priorities on similar baskets of reward signals would shape behavior in importantly different ways, which would manifest at a macro level as personality characteristics.