For example, one might assess whether the current thought will lead to satisfying the AGI’s curiosity drive, another its altruism drive, etc. The Steering Subsystem combines these into an aggregate reward. But the function that it uses to do so is a hardcoded, human-legible function—e.g., it might be as simple as a weighted average.
Huh. I wonder how much human variation in personality is basically intraspecies variation in the weightings on the reward function in the steering subsystem.
It sure seems like the big five factors could correspond to higher or lower weightings on, or higher or lower sensitivities to, different kinds of hard-coded reward signals.
Openness - Insight or curiosity, maybe counterbalanced by disgust or confusion or uncertainty
Extraversion - Positive social reinforcement (maybe counterbalanced by negative social reinforcement?)
Neuroticism - Fear or anxiety
Agreeableness - Some different kind of social reinforcement?
Conscientiousness - ??
It sure seems like different mixes of priorities on similar baskets of reward signals would shape behavior in importantly different ways, which would manifest at a macro level as personality characteristics.
When you’re driving a car, it is a critically important safety requirement that when you turn the steering wheel, the wheels respond instantaneously. By the same token, I expect that it will be a critically important safety requirement for humans to be able to change an AGI’s deepest desires instantaneously when we press the appropriate button. So I think this is an awesome feature, and I’m happy to have it, even if I’m not 100% sure exactly what to do with it. (In a car, you can see where you’re going, whereas understanding what the AGI is trying to do at any given moment is much more fraught.)
Is there an additional claim that the AI will not exhibit the standard problems of corrigibility (ie it won't stop you from changing its reward function) because it's not natively an expected maximizer?
That is, even though the AI's generative world model knows that it will fail to accomplish it's current goals if the reward function is changed, the thought assessors haven't had the opportunity to learn that, and don't know it? The AI understands this "intellectually" but not "viscerally"? It's like the common human relationship with cocaine?
...
That doesn't seem right to me. If the AI is playing a video game, and it is about to be attacked by a powerful adversary that will bring it's hit points to 0, it will correctly identify that this upcoming event will harm it's goals, and take action to prevent it.
The AI doesn't need to have experienced "loosing all its hit points from this particular enemy" or even "loosing all its hit points", for the thought assessors to give high valence to thoughts/plans to prevent that bad outcome. It just has to assign high valence to the concept of winning the game, and it it will assign high valence to harm-preventing-actions, right?
This is disanalogous to the case of humans deciding not to wirehead with drugs because reward is not the optimization target. But "concepts that the thought assessors have awarded high valance" are the optimization target (more or less). Finding out that something will directly impact your reward is not motivating. Finding out that something will impact one of your high-valence goals is motivating.
This makes it seem like even if realtime steering is mechanistically afforded by the AGI design, the AGI will take steps to prevent (most) alterations to its reward function in its steering subsystem, by default.
It's only "most" alterations to its reward function, because some alterations to the reward function will increase AGIs effectiveness at accomplishing it's high-valence-concept goals. And noticing that, thoughts/plans to modify the reward function accordingly will be awarded high valence. The AGI will want to self modify in ways that support its existing high valence goals, but not to be modified (by anyone) in ways that don't support its existing high valence goals.
I think it’s literally true, although I could be wrong, happy to discuss.
I know a lot less neuroscience than you do, and so I'm happy to take your word for it.
It sounds like you're saying there are two different memory systems and they effectively store duplicates of ~ all the memories that the second system stories (since the first system stores additional memories that aren't stored by the second system)?
I would have guessed that Rohypnol works by preventing the whole cortex (including HC) from editing synapses. But that’s just a guess. Do you have a reason to think otherwise?
Just that people's short term memory is not impaired by Rohypnol. There are other side effects, like impaired judgement, but I believe (and Claude confirms to me) that you'll still be able to remember someone's name from a couple of minutes ago—long enough that you couldn't have been holding it in working memory.
(I had guessed that that was because they were being represented in the hippocampus but were blocked from being encoded in the (iso)cortex. But it sounds like you don't think that's what's going on, and I have no reason to doubt you.)
However, if the mechanism of action of Rohypnol is preventing updates to the synapses, as you suggest, that implies that short term memories are not represented by synaptic weights.
The basic point that it's possible to disrupt long term memory formation without disrupting short term memory function seems to suggest that they're separate processes. That at least makes it less definitive that remembering something from 60 seconds ago is an example of online learning in the sense of "permanently and rapidly changing your world model on the fly". It might be that, or it might be more like writing to some kind of temporary memory medium and then later training on that memory.
If you introduce yourself to me as “Fred”, and then 60 seconds later I refer to you as “Fred”, then I can thank online learning for putting that bit of knowledge into my brain.
I don't think this is literally true? Do short term memories literally change the synaptic weights in the brain?
I thought that memory formation happens in the hippocampi, and then the memory traces gradually "migrate" to the cortex. And if you give people drugs like Rohypnol you'll block that process, and so they won't form long term memories. But they'll still be able to form and use short term memories, like the name of a person they were introduced to 60 seconds ago.
It seems like, with humans, short term memory is more like in context learning (though maybe that's also a bad analogy for what the hippocampi are doing) and then there's some other process (occurring mostly during sleep?) that consolidates those short term memories: encoding them as synaptic weights in the cortex.
Which is to say that the training is not entirely online. It's semi-episode-like, but with an added mechanism that uses short term learning as training data for another episode of training (or something).
I get that this post is skipping over a lot of the mechanistic details, but I'll observe that that there are important behavioral differences between the two hypotheses...
The thought assessors award valence to a thought in proportion to how strongly it activates a high-yumminess concept.
The thought assessors award valence to a thought in proportion to their estimate (or the cortex's own estimate?) of how likely the thought is to lead to actually experiencing the high-yumminess experience.
...because those two things come apart. There are some thoughts that will strongly activate a concept of an experience, but have low probability of leading to that experience, and vis versa.
As an example, I would bet that vividly imagining eating Prinsesstårta cake activates the concept "eating Prinsesstårta cake" much more going to google maps, so that you can find an ATM, so that you can get cash, so that you can go to the store and buy eggs and milk, so that you can bake a Prinsesstårta cake, so you can eat it. But the second thought seems more likely to result in you actually eating the cake!
Though, in practice, it seems like both kinds of behavior (vividly fantasizing and making multi-step plans) both occur, some of the time, so it might be both, or something similar enough to both.
I want to get more clarity on what you mean by a "thought", and which processing you're claiming are thoughts, and which are non-thought operations that select the next thought to think. And (if this question carves reality at the joints) which you think is conscious processing and which you think isn't.
(I want to be clear that I'm referring to your concept of "thought" here. We might replace "thought" with "thought_Byrne".)
Sequential "thoughts"?
It sounds like, you're using the word "thought" to refer specifically to the cognitive content that is occupying the limited capacity of the "global workspace" (or something like a global workspace). Thoughts utilize the whole cortex, so you can't have more than one at once, so they're necessarily sequential. They're like CPU-operations.
In my view, thoughts are complicated. To think the thought “I will go to the café”, you’re not just activating some tiny cluster of dedicated go-to-the-café neurons. Instead, it’s a distributed pattern involving practically every part of the cortex. You can’t simultaneously think “I will go to the café” and “I will go to the gym”, because they would involve different activity patterns of the same pools of neurons. They would cross-talk. Thus, the only possibility is thinking the thoughts in sequence.
Are "thoughts" always conscious (ie available to verbal reports)?
Is everything that's available to verbal report a thought?
Or does it become a "thought" in the process of attending to it to produce a verbal report? (eg I'm conscious of various elements of my visual field, (the black border around my macbook screen, or the open door behind my screen), but they only become "thoughts" when I start to pay attention to them, which is necessary for making verbal reports about them?
Parallel proto-thoughts?
But it also sounds like you're also saying that there can be a plethora of competing urges, or proto-thoughts, or something, all vying to be anointed as the next thought. (Feel free to suggest better terminology for these, if you have it.)
Each region of the pallium [= lamprey equivalent of cortex] sends a connection to a particular region of the striatum, which (via other parts of the basal ganglia) returns a connection back to the same starting location in the pallium. This means that each region of the pallium is reciprocally connected with the striatum via a specific loop that regulates a particular action…. For example, there’s a loop for tracking prey, a loop for fleeing predators, a loop for anchoring to a rock, and so on. Each region of the pallium is constantly whispering to the striatum to let it trigger its behavior, and the striatum always says “no!” by default. In the appropriate situation, the region’s whisper becomes a shout, and the striatum allows it to use the muscles to execute its action.
I endorse this as part of my model of decision-making, but only part of it. Specifically, this is one of the things that’s happening when the Thought Generator generates a thought. Different simultaneous possibilities are being compared.
Am I correct in calling these examples of "proto-thoughts" and not of "thoughts" (at least until they graduate to "thoughts")?
Is there a reason why these "proto-thoughts" don't have the problem cited above, that force "thoughts" to be sequential? Are proto-thoughts of a different type than "thoughts" such that they don't draw on neurons all over the brain? If not, why can proto-thoughts all be active simultaneously, but "thoughts" can't be?
Additionally, is the implication that these proto-thoughts are necessarily subconscious / pre-conscious? (Because if you were conscious of them, they would (definitionally?) have graduated to being "thoughts"?)
Thoughts or proto-thoughts?
In this example...
Imagine a simple, ancient, little fish swimming along, navigating to the cave where it lives. It gets to a fork in the road, ummm, “fork in the kelp forest”? Its current navigation plan involves continuing left to its cave, but it also has the option of turning right to go to the reef, where it often forages.
Seeing this path to the right, I claim that its navigation algorithm reflexively loads up a plan: “I’m gonna turn right and go to the reef.” Immediately, this new plan is evaluated and compared to the old plan. If the new plan seems worse than the old plan, then the new thought gets shut down, and the old thought (“I’m going to my cave”) promptly reestablishes itself. The fish continues to its cave, as originally planned, without skipping a beat. Whereas if instead the new plan seems better than the old plan, then the new plan gets strengthened, sticks around, and orchestrates motor commands. And thus the fish turns to the right and goes to the reef instead.
...do I understand correctly that all of the activity described is of proto-thoughts? You're not talking about what the fish is thinking, you're talking about what's happening under the hood that causes the fish to think the thoughts that it thinks.
Like, I could tell a story:
A fish that was swimming towards the cave where it lives. But on the way it had the thought "maybe I should go to the reef instead". It considered that plan, and then decided against, continuing on it's way to it's cave.
If I understand correctly, that isn't the story you're telling. Rather you're trying to explain the mechanisms that causes the fish to "promote 'the possibility of going to the reef' to attention" in the first place.
Is that right?
(This might turn out to be a silly question in the case of the fish, because maybe fish don't do the mental operation of "considering", the fish brain just selects motor plans directly. But humans definitely do an operation of "considering", which entails some non-thinking mechanism that selects the next thought to think.)
Two types of “valence” in my model—“real” and “guessed”
The blue-circled signal is the valence guess from the corresponding Thought Assessor in the striatum. The red-circled signal (again, it’s one signal drawn twice) is the corresponding “ground truth” for what the valence guess should have been.
Just like the other “long-term predictors” discussed in the previous post, the Steering Subsystem can choose between “defer-to-predictor mode” and “override mode”. In the former, it sets the red equal to the blue, as if to say “OK, Thought Assessor, sure, I’ll take your word for it”. In the latter, it ignores the Thought Assessor’s proposal, and its own internal circuitry outputs some different value.[3]
Just to be clear, these paragraphs mean that the arrows labeled "Actual valence" are often just a duplicate of the "Valence guess", specifically when the steering system is in "defer-to-predictor" mode. When in that mode, the Steering Subsystem doesn't add any informational content that directs the thought generator, right?
In "defer-to-predictor" mode, all of the informational content that directs thought rerolls is coming from the thought assessors in the Learned-from-Scratch part of the brain, even if if that information is neurologically routed through the steering subsystem?
[Edit: Or is "Actual valence" never just defering to the predictor, because it's calculated based on the whole scorecard?]
[Edit2: In this post, you say "And I’ll also assume the Steering Subsystem is in defer-to-predictor mode for the valence signal, rather than override mode (see Post #6, §6.4.2).", so I guess not.]
The thing that was most confusing for me was step 5.
I first needed to clarify for myself that because the STP's output was feeding back in as input, it could adopt any value in that stage.
Then I needed to realize that, given that it could take any value, the optimal predictive strategy is to adopt the the value of the next predicted override "just to be safe".
(I don't know if that helps you at all.)
...
Actually, this begs the question a bit.
Suppose that a STP is in the self-looping mode at t_0. The next override will arrive at t_10. Also, suppose that the context clues at t_8, are very strongly informative of the timing of the next ground truth injection: it will occur sometime within the next three timesteps, 10_9 to 10_11.
There's definitely pressure to start predicting the override at t_8, but is there a pressure pushing the STP to start predicting the override all the way back at t_0? Why not have random outputs until t_8, and then switch to predicting the override?
Is it just that there are rarely definitive contextual evidence that allows you to time the ground truth injection, even with a wide interval?
If the STP can precisely bound the timing of the overrides does that break this whole system?
Here's my understanding of the model presented in this post. Please me know if I'm missing or misunderstanding anything.
The learning subsystem of the brain includes a bunch of short term predictors. These predictors take in information from the all over the brain to predict values in the steering subsystem, in particular.
These short term predictors get feedback by sending their predictions down to the steering subsystem, and in return get supervision signals that they can use to learn their policy.
The steering subsystem will often just turn around and send the prediction back as the supervision signal. When this happens, it creates a feedback loop of the predictor predicting itself.
When the short term predictor is in this self-loop mode, it has degrees of freedom about what state it's in. Any prediction that it offers is a "correct answer".
However, the steering subsystem will send a ground truth signal at some point. So the best strategy for a short term predictor in a self-loop mode is to jump to predicting the next override as soon as it can guess what the next override will be.
For instance, as soon as that short term predictor notices a change in context like "you had the thought to go put on a sweater", it immediately starts predicting the positive reward of "feel warm and cozy" from having put on a sweater.
It could self-stably give any prediction. But immediately predicting the reward of putting on a sweater is the best policy. It won't get penalized while it's in the self-loop mode (since whatever the short term outputs is marked correct) but when the ground truth injection does arrive the short term predictor will already be correctly predicting that ground truth injection.
That is, jumping to predicting the next ground truth injection as soon as the predictor can guess what it will be does no worse than any other possible prediction during the self-loop mode, and it does better than other possible predictions during the steering override mode.
This feedback makes the short-term predictors into long term predictors, because they're effectively learning to predict the next "steering override", based on context. This allows the short term predictors to learn predictive patterns that occur across many different timescales.
The timescale that they learn to predict over depends the on frequency with which the steering subsystem provides ground truth injections: if it's around once a minute, the predictor learns to predict a minute into the future, if it's around once an hour, the predictor learns to predict an hour into into the future and so on. (And if the frequency of ground truth injections are variable, but predictable from brain context, the predictors will learn a policy that predicts over different time horizons depending on that context?)
Those long term predictors can be used as control signals. These long term predictors can steer behavior based on outcomes that will only occur minutes or hours into the future.
In the sweater example, the brain can use the expectation of feeling warm and cozy, or not, as an indicator that the current behavior is on track for producing the warm and cozy reward.
This setup constructs a long term reward-seeker out of an immediate reward-seeker and a pile of predictors.
This strikes me as the kind of thing that could actually, really, help the situation, if it was excellently executed.