If a mind comes to understand a bunch of stuff, there's probably some compact reasons that it came to understand a bunch of stuff. What could such reasons be? The mind might copy a bunch of understanding from other minds. But if the mind becomes much more capable than surrounding minds, that's not the reason, assuming that much greater capabilities required much more understanding. So it's some other reason. I'm describing this situation as the mind being on a trajectory of creativity.
(Sorry, I didn't get this on two readings. I may or may not try again. Some places I got stuck:
Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever? This still has the basic alignment problem: I don't know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever.
Or are you rather saying (or maybe this is the same as / a subset of the above?) that the Mask is preventing potential agencies from coalescing / differentiating and empowering themselves with the AI system's capability-pieces, by literally hiding from the potential agencies and therefore blocking their ability to empower themselves?
Anyway, thanks for your thoughts.)
That was one of the examples I had in mind with this post, yeah. (More precisely, I had in mind defenses of HCH being aligned that I heard from people who aren't Paul. I couldn't pass Paul's ITT about HCH or similar.)
Yeah, I think that roughly lines up with my example of "generator of large effects". The reason I'd rather say "generator of large effects" rather than "trying" is that "large effects" sounds slightly more like something that ought to have a sort of conservation law, compared to "trying". But both our examples are incomplete in that the supposed conservation law (which provides the inquisitive force of "where exactly does your proposal deal with X, which it must deal with somewhere by conservation") isn't made clear.
I don't recall seeing that theory in the first quarter of the book, but I'll look for it later. I somewhat agree with your description of the difference between the theories (at least, as I imagine a predictive processing flavored version). Except, the theories are more similar than you say, in that FIAT would also allow very partial coherentifying, so that it doesn't have to be "follow these goals, but allow these overrides", but can rather be, "make these corrections towards coherence; fill in the free parameters with FIAT goals; leave all the other incoherent behavior the way it is". A difference between the theories (though I don't feel I can pass the PP ITT) is that FIAT allows, you know, agency, as in, non-myopic goal pursuit based on coherent-world-model-building, whereas PP maybe strongly hints against that?
It seems like the thing to do is to look for cases where people pursue their own goals, rather than the goals they would predict they have based on past actions.
I'm confused by this; are these supposed to be mutually exclusive? What's "their own goals"? [After thinking more: Oh like you're saying, here's what it would look like to have a goal that can't be explained as a FIAT goal? I'll assume that in the rest of this comment.]
It needs to be complex enough to not plausibly be a reflex/instinct.
Agreed.
A sort of plausible example is courtship. It's complex, it can't easily be inferred from previous things you did (not the first time you do it, that is), and it agentically orients toward a goal.
I'm not sure I buy that it can't be inferred, even the first time. Maybe you have fairly built-in instincts that aren't about the whole courtship thing, but cause you to feel good when you're around someone. So you seek being around them, and pay attention to them. You try to get them interested in being around you. This builds up the picture of a goal of being together for a long time. (This is a pretty poor explanation as stated; if this explanation works, why wouldn't you just randomly fall in love with anyone you do a favor for? But this is why it's at least plausible to me that the behavior could come from a FIAT-like thing. And maybe that's actually the case with homosexual intercourse in the 1800s.)
The problem is, I think it's well-explained as imitation - "I'm a person; the people around me do this and seem really into it; so I infer that I'm really into it too".
Maybe courtship is especially much like this, but in general things sort-of-well-explainable as imitation seem like admissible falsifications of FIAT, e.g. if there are also pressures against the behavior.
Thanks. Your comments make sense to me I think. But, these essays are more like research notes than they are trying to be good exposition, so I'm not necessarily trying to consistenly make them accessible. I'll add a note to that effect in future.
Yeah, that could produce an example of Doppelgängers. E.g. if an autist (in your theory) later starts using that machinery more heavily. Then there's the models coming from the general-purpose analysis, and the models coming from the intuitive machinery, and they're about the same thing.
An interesting question I don't know the answer to is if you get more cognitive empathy past the end of where human psychological development seems to stop.
Why isn't the answer obviously "yes"? What would it look like for this not to be the case? (I'm generally somewhat skeptical of descriptions like "just faster" if the faster is like multiple orders of magnitude and sure seems to result from new ideas rather than just a bigger computer.)
Yes, I think there's stuff that humans do that's crucial for what makes us smart, that we have to do in order to perform some language tasks, and that the LLM doesn't do when you ask it to do those tasks, even when it performs well in the local-behavior sense.