The verdict that knowledge is purely a property of configurations cannot be naively generalized from real life to GPT simulations, because "physics" and "configurations" play different roles in the two (as I'll address in the next post). The parable of the two tests, however, literally pertains to GPT. People have a tendency to draw erroneous global conclusions about GPT from behaviors which are in fact prompt-contingent, and consequently there is a pattern of constant discoveries that GPT-3 exceeds previously measured capabilities given alternate conditio

... (read more)

This kind of comment ("this precise part had this precise effect on me") is a really valuable form of feedback that I'd love to get (and will try to give) more often. Thanks! It's particularly interesting because someone gave feedback on a draft that the business about simulated test-takers seemed unnecessary and made things more confusing.

Since you mentioned, I'm going to ramble on about some additional nuance on this point.

Here's an intuition pump which strongly discourages "fundamental attribution error" to the simulator:

Imagine a machine where you feed... (read more)

I don't know, I think of the brain as doing credit assignment pretty well, but we may have quite different definitions of good and bad. Is there an example you were thinking of?

Say that the triggers for pleasure are hardwired. After a pleasurable event, how do only those computations running in the brain that led to pleasure (and not those randomly running computations) get strengthened? After all, the pleasure circuit is hardwired, and can't reason causally about what thoughts led to what outcomes.

(I'm not currently confident that pleasure is exactly the ... (read more)

There's a further question which is "How do people behave when they're given more power over and understanding of their internal cognitive structures?", which could actually resolve in "People collapse onto one part of their values." I just think it won't resolve that way.

And if it looks like this comes in hindsight by carefully reflecting on the situation, that's not without reinforcement. Your thoughts are scored against whatever it is that the brainstem is evaluating. And same as above, earlier or later, you stumble into some thoughts where the pattern is more clearly attributable, and then the weights change.

Maybe. But your subcortical reinforcement circuitry cannot (easily) score your thoughts. What it can score are the mystery computations that led to hardcoded reinforcement triggers, like sugar molecules interfacing

I mean scoring thoughts in the sense of [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering with what Steven calls "Thought Assessors". Thoughts totally get scored in that sense.