In other words, the very essence of intelligence is coming up with new ideas, and that’s exactly where the value function is most out on a limb and prone to error.
But what exactly are new ideas? It could be the case that intelligence is pattern-matching at it most granural level even for "noveties". What could come in handy here is a great flagging mechanism for understanding when the model is out-of-distribution. However, this could come at its own cost.
It gets even worse if a self-reflective AGI is motivated to deliberately cause credit assignment failures.
Is the use of "deliberately" here trying to account for the *thinking about its own thoughts*-part of going back and forth between thought generator and thought assesor?
“A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.
Okay, now I get the point of "neither like nor dislike" in your original statement.
I was originally thinking of sth as follows: "A year before you met your current boyfriend, would you have thought he was cute, if he was your type?". But "your type" requires seeing them to get a reference point of if they belong in that class or not. So there's a circular statement of my own, straightened out, so you had a good point here.
That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.
I would say the strategic behavior AlphaZero exhibits is weak (still incredible, specifically with the kind of weird h4 luft lines that the latest supercomputers show). I was thinking of a stronger version dealing with multi-agent environments, continuous state/action spaces, and/or multi-objective reward functions. That said, its seems to me that a different problem has to be solved to get the solution to this.
I liked the painting metaphor, and the diagram of brain-like AGI motivation!
Got a couple of questions below.
It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that nameless pattern.
I agree that if you haven't seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your "like" region? (it was somewhere in close to things you liked)
For example, say someone enjoys the affection of cat species A, B. Say they haven't experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn't include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the "like cat" region.
Similarly, you can conceptualize a single future state of the world in many different ways, e.g. by attending to different aspects of it, and it will thereby become more or less appealing. This can lead to circular preferences; I put an example in this footnote.
Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.
Imperfect data/architecture/training alg could lead to weird types of thinking when employed OOD. Do you think it would be helpful to try and measure for the coherency of the system's actions/thoughts? E.g. make datasets that inspect the agent's theory of mind (I think Beth Barnes suggested sth like this). I am unsure about what these metrics would imply for AGI safety.
Namely: It seems to me that there is not a distinction between instrumental and final preferences baked deeply into brain algorithms. If you think a thought, and your Steering Subsystem endorses it as a high-value thought, I think the computation looks the same if it’s a high-value thought for instrumental reasons, versus a high-value thought for final reasons.
The answer for this should depend on the size of the space that the optimization algorithm searches over.
It could be the case that the space of possible outcomes for final preferences is smaller than that of instrumental ones, and thus we could afford a different optimization algorithm (or variant thereof).
Also, if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?