Director of Research at PAISRI
Interesting. I can't recall if I commented on the alignment as translation post about this, but I think this is in fact the key thing standing in the way of addressing alignment, and put together a formal model that identified this as the problem, i.e. how do you ensure that two minds agree about preference ordering, or really even the statements being ordered.
This post I wrote a while back has some references you might find useful: "A developmentally-situated approach to teaching normative behavior to AI".
Also I think some of the references in this paper I wrote might be useful: "Robustness to fundamental uncertainty in AGI alignment".
Topics that seem important to cover to me include not only AI impact on humans but also questions surrounding the subjective experience of AI, which largely revolves around the question of if AI have subjective experience or are otherwise moral patients at all.
Branch predictors for sure, but modern CPUs also do things like managing multiple layers of cache using relatively simple algorithms that nonetheless in practice get high hit rates, conversion of instructions into microcode because it turns out small, simple instructions execute faster but CPUs need to do a lot of things so the tradeoff is to have the CPU interpret the instructions in real time into simpler instructions sent to specialized processing units inside the CPU, and maybe even optimistic branch execution, where instructions in the pipeline are partially executed provisionally ahead of branches being confirmed. All of these things seem like tricks of the sort I wouldn't be surprised to find parallels to in the brain.
Your toy models drew a parallel for me to modern CPU architectures. That is, doing computation the "complete" way involves loading things from memory, doing math, writing to memory, and then that memory might affect later instructions. CPUs have all kinds of tricks to get around this to go faster, and it sort of like like your models of brain parts, only with a reversed etiology, since the ACU came first whereas the neocortex came last, as i understand it.
Hmm, I see some problems here.
By looking for manipulation on the basis of counterfactuals, you're at the mercy of your ability to find such counterfactuals, and that ability can also be manipulated such that you can't notice either the object level counterfactuals that would make you suspect manipulation of the counterfactuals about your counterfactual reasoning that would make you suspect manipulation. This seems insufficiently robust way to detect manipulation, or even define it since the mechanism of detecting it can itself be manipulated to not notice what would have otherwise been considered manipulation.
Perhaps my point is to generally express doubt that we can cleanly detect manipulation outside the context of the human behavioral norms, and I suspect the cognitive machinery that implements norms is malleable enough that it can be manipulated to not notice what it would have previously thought was manipulation, nor is it clear this is always bad, since in some cases we might be mistaken in some sense about what is really manipulative, although this belies the point that it's not clear what it means to be mistaken about normative claims.
So "no manipulation" or "maintaining human free will" seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.
One, this seems likely to have some overlap with notions of impact and impact measures.
Two, it seems like there's no real way to eliminate manipulation in a very broad sense, because we'd expect our AI to be causally entangled with the human, so there's no action the AI could take that would not influence the human in some way. Whether or not there is manipulation seems to require making a choice about what kind of changes in the human's behavior matter, similar to problems we face in specifying values or defining concepts.
I recently watched all 7 seasons of HBO's "Silicon Valley" and the final episode (or really the final 4 episodes leading up into the final one) did a really great job of hitting on some important ideas we talk about in AI safety.
Now, the show in earlier seasons has played with the idea of AI with things like an obvious parody of Ben Goertzel and Sophia, discussion of Roko's Basilisk, and of course AI that Goodharts. In fact, Goodharting is a pivotal plot point in how the show ends, along with a Petrov-esque ending where hard choices have to be made under uncertainty to protect humanity and it has to be kept a secret due to an information hazard.
Goodhart, Petrov, and information hazards are not mentioned by name in the show, but the topics are clearly present. Given that the show was/is popular with folks in the SF Bay Area tech scene because it does such a good job of mirroring back what it's like to live in that scene, even if it's a hyperbolic characterization, I wonder if and hope that this will helpfully nudge folks towards normalizing taking AI safety seriously and seeing it as virtuous to forgo personal gain in exchange for safeguarding humanity.
I don't expect for things to change dramatically because of the show, but on the margin it might be working to make us a little bit safer. For that reason I think it's likely a good idea to encourage folks not already dedicated to AI safety to watch the show, so long as the effort involved in minimal.
One thing I like about this series is that it puts all this online in a fairly condensed form, which I feel like I often am not quite sure what to link to in order to present these kinds of arguments. That you do it better than perhaps we have done in the past makes it all the better!
Any model is going to be in the head of some onlooker. This is the tough part about the white box approach: it's always an inference about what's "really" going on. Of course, this is true even of the boundaries of black boxes, so it's a fully general problem. And I think that suggests it's not a problem except insofar as we have normal problems setting up correspondence between map and territory.
I'm excited to see this cross-over into AI safety discussions. I work on what we often call "reliability engineering" in software, and I think there's a lot of lessons there that apply here, especially the systems-based or highly-contextualized approach, since it acknowledges the same kind of failure as, say, was pointed out in The Design of Everyday Things: just because you build something to spec doesn't mean it works if humans make mistakes using it.
I've not done a lot to bring that over to LW or AF, other than a half-assed post about normalization of deviance. I'm not a great explainer, so I often feel like it's not the most valuable thing for me to do, but I think getting people pointed to this field seems neglected and valuable to get them thinking more about how real systems fail today, rather than theorizing about how AI might fail in the future, or the relatively constrained ways AI fails today.