[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

[-]Joe Kwon4y30

Really appreciated this post and I'm especially excited for post 13 now! In the past month or two, I've been thinking about stuff like "I crave chocolate" and "I should abstain from eating chocolate" as being a result of two independent value systems (one whose policy was shaped by evolutionary pressure and one whose policy is... idk vaguely "higher order" stuff where you will endure higher states of cortisol to contribute to society or something).

I'm starting to lean away from this a little bit, and I think reading this post gave me a good idea of what your thoughts are, but it'd be really nice to get confirmation (and maybe clarification). Let me know if I should just wait for post 13. My prediction is that you believe there is a single (not dual) generator of human values, which are essentially moderated at the neurochemical level, like "level of dopamine/serotonin/cortisol". And yet, this same generator, due to our sufficiently complex "thought generator", can produce plans and thoughts such as "I should abstain from eating chocolate" even though it would be a dopamine hit in the short-term, because it can simulate forward much further down the timeline, and believes that the overall neurochemical feedback will be better than caving into eating chocolate, on a longer time horizon. Is this correct?

If so, do you believe that because social/multi-agent navigation was essential to human evolution, the policy was heavily shaped by social world related pressures, which means that even when you abstain from the chocolate, or endure pain and suffering for a "heroic" act, in the end, this can all still be attributed to the same system/generator that also sometimes has you eat sugary but unhealthy foods?

Given my angle on attempting to contribute to AI Alignment is doing stuff to better elucidate what "human values" even is, I feel like I should try to resolve the competing ideas I've absorbed from LessWrong: 2 distinct value systems vs. singular generator of values. This post was a big step for me in understanding how the latter idea can be coherent with the apparent contradictions between hedonistic and higher-level values.

[-]Steven Byrnes4y30

Thanks!

Right, I think there's one reward function (well, one reward function that's relevant for this discussion), and that for every thought we think, we're thinking it because it's rewarding to do so—or at least, more rewarding than alternative thoughts. Sometimes a thought is rewarding because it involves feeling good now, sometimes it's rewarding because it involves an expectation of feeling good in the distant future, sometimes it's rewarding because it involves an expectation that it will make your beloved friend feel good, sometimes it's rewarding because it involves an expectation that it will make your admired in-group members very impressed with you, etc.

I think that the thing that gets rewarded is thoughts / plans, not just actions / states. So we don't have to assume that the Thought Generator is proposing an action that's unrewarding now (going to the gym) in order to get into a more-rewarding state later on (being ripped). Instead, the Thought Generator can generate one thought right now, “I'm gonna go to the gym so that I can get ripped”. That one thought can be rewarding right now, because the “…so that I can get ripped” is right there in the thought, providing evidence to the brainstem that the thought should be rewarded, and that evidence can plausibly outweigh the countervailing evidence from the “I'm gonna go to the gym…” part of the thought.

I do think there's still an adjustable parameter in the brain related to time-discounting, even if the details are kinda different than in normal RL. But I don't see a strong connection between that and social instincts. For example, if you abstain from ice cream to avoid a stomach ache, that's a time-discounting thing, but it's not a social-instincts thing. It's possible that social animals in general are genetically wired to time-discount less than non-social animals, but I don't have any particular reason to expect that to be the case. Or, maybe humans in particular are genetically wired to time-discount less than other animals, I don't know, but if that's true, I still wouldn't expect that it has to do with humans being social; rather I would assume that it evolved because humans are smarter, and therefore human plans are unusually likely to work out as predicted, compared to other animals.

I think social instincts come from having things in the innate reward function that track “having high status in my in-group” and “being treated fairly” and “getting revenge” and so on. (To a first approximation.) Post #13(ish) will be a (hopefully) improved and updated version of this discussion of how such things might get actually incorporated into the reward function, given the difficulties related to symbol-grounding. You might also be interested in my post (Brainstem, Neocortex) ≠ (Base Motivations, Honorable Motivations).

Hope this helps, happy to talk more, either here or by phone if you think that would help. :)

[-]Nathan Helm-Burger2y10

I recommend this paper on the subject for additional reading:

The basal ganglia select the expected sensory input used for predictive coding

[-]Rafael Harth4y00

(Extremely speculative comment, please tell me if this is nonsense.)

If it makes sense to differentiate the "Thought Generator" and "Thought Assessor" as two separate modules, is it possible to draw a parallel to language models, which seem to have strong ability to generate sentences, but lack the ability to assess if they are good?

My first reaction to this is "obviously not since the architecture is completely different, so why would they map onto each other?", but a possible answer could be "well if the brain has them as separate modules, it could mean that the two tasks require different solutions, and if one is much harder than the other, and the harder one is the assess module, that could mean language models would naturally solve just the generation first".

The related thing that I find interesting is that, a priori, it's not at all obvious that you'd have these two different modules at all (since the thought generator already receives ground truth feedback). Does this mean the distinction is deeply meaningful? Well, that depends on close to optimal the [design of the human brain] is.

[-]Steven Byrnes4y*10

If it makes sense to differentiate the "Thought Generator" and "Thought Assessor" as two separate modules, is it possible to draw a parallel to language models, which seem to have strong ability to generate sentences, but lack the ability to assess if they are good?

Hmm. An algorithm trained to reproduce human output is presumably being trained to imitate the input-output behavior of the whole system including Thought Generator and Thought Assessor and Steering Subsystem.

I’m trying to imagine deleting the Thought Assessor & Steering Subsystem, and replacing them with a constant positive RPE (i.e., “this is good, keep going”) signal. I think what you get is a person talking with no inhibitions whatsoever. Language models don’t match that.

a priori, it's not at all obvious that you'd have these two different modules at all (since the thought generator already receives ground truth feedback)

I think it’s a necessary design feature for good performance. Think of it this way. I’m evolution. Here are two tasks I want to solve:

(A) estimate how much a particular course-of-action advances inclusive genetic fitness,
(B) find courses of action that get a good score according to (A).

(B) obviously benefits from incorporating a learning algorithm. And if you think about it, (A) benefits from incorporating a learning algorithm as well. But the learning algorithms involved in (A) and (B) are fundamentally working at cross-purposes. (A) is the critic, (B) is the actor. They need separate training signals and separate update rules. If you try to blend them together in an end-to-end way, you just get wireheading. (I.e., if the (B) learning algorithm had free reign to update the parameters in (A), using the same update rule as (B), then it would immediately set (A) to return infinity all the time, i.e. wirehead.) (Humans can wirehead to some extent (see Post #9) but we need to explain why it doesn't happen universally and permanently within five minutes of birth.)

[-]Rafael Harth4y10

I think what you get is a person talking with no inhibitions whatsoever. Language models don’t match that.

What do you picture a language model with no inhibitions to look like? Because if I try to imagine it, then "something that outputs reasonable sounding text until sooner or later it fails hard" seems to be a decent fit. Of course haven't thought much about the generator/assessor distinction.

I mean, surely "inhibitions" of the language model don't map onto human inhibitions, right? Like, a language model without the assessor module (or a much worse assessor module) is just as likely to be imitate someone who sounds unrealistically careful as someone who has no restraints.

I find your last paragraph convincing, but that of course makes me put more credence into the theory rather than less.

[-]gwern4y30

An analogy that comes to mind is sociopathy. Closely linked to fear/reward insensitivity and impulsivity. Something you see a lot in case studies of diagnosed or accounts of people who look obviously like sociopaths is that they will be going along just fine, very competent and intelligent seeming, getting away with everything, until they suddenly do something which is just reckless, pointless, useless and no sane person could possibly think they'd get away with it. Why did they do X, which caused the whole house of cards to come tumbling down and is why you are now reading this book or longform investigative piece about them? No reason. They just sorta felt like it. The impulse just came to them. Like jumping off a bridge.

[-]Steven Byrnes4y20

Huh. I would have invoked a different disorder.

I think that if we replace the Thought Assessor & Steering Subsystem with the function “RPE = +∞ (regardless of what's going on)”, the result is a manic episode, and if we replace it with the function “RPE = -∞ (regardless of what's going on)”, the result is a depressive episode.

In other words, the manic episode would be kinda like the brainstem saying “Whatever thought you're thinking right now is a great thought! Whatever you're planning is an awesome plan! Go forth and carry that plan out with gusto!!!!” And the depressive episode would be kinda like the brainstem saying “Whatever thought you're thinking right now is a terrible thought. Stop thinking that thought! Think about anything else! Heck, think about nothing whatsoever! Please, anything but that thought!”

My thoughts about sociopathy are here. Sociopaths can be impulsive (like everyone), but it doesn't strike me as a central characteristic, as it is in mania. I think there might sometimes be situations where a sociopath does X, and onlookers characterize it as impulsive, but in fact it's just what the sociopath wanted to do, all things considered, stemming from different preferences / different reward function. For example, my impression is that sociopaths get very bored very easily, and will do something that seems crazy and inexplicable from a neurotypical perspective, but seems a good way to alleviate boredom from their own perspective.

(Epistemic status: Very much not an expert on mania or depression, I've just read a couple papers. I've read a larger number of books and papers on sociopathy / psychopathy (which think are synonyms?), plus there were two sociopaths in my life that I got to know reasonably well, unfortunately. More of my comments about depression here.)

[-]Rafael Harth4y10

Do you think this describes language models?

[-]gwern4y20

'Insensitivity to reward or punishment' does sound relevant...

[-]Rafael Harth4y10

Yes, but I didn't mean to ask whether it's relevant, I meant to ask whether it's accurate. Does the output of language models, in fact, feel like this? Seemed like something relevant to ask you since you've seen lots of text completions.

And if it does, what is the reason for not having long timelines? If neural networks only solved the easy part of the problem, that implies that they're a much smaller step toward AGI than many argued recently.

[-]gwern4y20

I said it was an analogy. You were discussing what intelligent human-level entities with inhibition control problems would hypothetically look like; well, as it happens, we do have such entities, in the form of sociopaths, and as it happens, they do not simply explode in every direction due to lacking inhibitions but often perform at high levels manipulating other humans until suddenly then they explode. This is proof of concept that you can naturally get such streaky performance without any kind of exotic setup or design. Seems relevant to mention.

^{^}

As in the previous post, the term “ground truth” here is a bit misleading, because sometimes the Steering Subsystem will just defer to the Thought Assessors.

^{^}

“Re-roll” was originally a board game term; for example, if you previously rolled the dice to randomly generate a character, then “re-roll for a new character” would mean “you roll the dice again to randomly generate a different character”.

^{^}

As in the previous post, I don’t really believe there is a pure dichotomy between “defer-to-predictor mode” and “override mode”. In reality, I’d bet that the Steering Subsystem can partly-but-not-entirely defer to the Thought Assessor, e.g. by taking a weighted average between the Thought Assessor and some other independent calculation.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

26

[Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL

26

6.1 Post summary / Table of contents

6.2 Big picture

6.2.1 Relation to “two subsystems”

6.2.2 Quick run-through

6.3 The “Thought Generator”

6.3.1 Overview

6.3.2 Thought Generator inputs

6.3.3 Thought Generator outputs

6.4 Valence, value, and rewards

6.4.1 Teaser for my “Valence” series

6.4.2 The striatum (“learned value function” / “critic”) guesses the valence, but the Steering Subsystem may choose to override

6.5 Decisions involve not only simultaneous but also sequential comparisons of value

6.5.1 Made-up example of what comparison-of-sequential-thoughts might look like in a simpler animal

6.5.2 Comparison-of-sequential-thoughts: why it’s necessary

6.5.3 Comparison-of-sequential-thoughts: how it might have evolved

6.6 Common misconceptions

6.6.1 The distinction between internalized ego-syntonic desires and externalized ego-dystonic urges is unrelated to Learning Subsystem vs. Steering Subsystem

6.6.1.1 The explanation I like

6.6.1.2 The explanation I don’t like

6.6.2 The Learning Subsystem and Steering Subsystem are not two agents

Changelog