If you are going to read just one thing I wrote, read The Problem of the Criterion.
More AI related stuff collected over at PAISRI
Small insight why reading this: I'm starting to suspect that most (all???) unintuitive things that happen with Oracles are the result of them violating our intuitions about causality because they actually deliver no information, in that nothing can be conditioned on what the Oracle says because if we could then the Oracle would fail to actually be an Oracle, so we can only condition on the existence of the Oracle and how it functions and not what it actually says, e.g. you should still 1-box but it's mistaken to think anything an Oracle tells you allows you to do anything different.
There's no observer-independent fact of the matter about whether a system "is" an agent
Worth saying, I think, that this is fully generally true that there's no observer-independent fact of the matter about whether X "is" Y. That this is true of agents is just particularly relevant to AI.
I'm not convinced there's an actual distinction to be made here.
Using your mass comparison example, arguably the only meaningful different between the two is where information is stored. In search-in-map it's stored in an auxiliary system; in search-in-territory it's embedded in the system. The same information is still there, though, all that's changed is the mechanism, and I'm not sure map and territory is the right way to talk about this since both are embedded/embodied in actual systems.
My guess is that search-in-map looks like a thing apart from search-in-territory because of perceived dualism. You give the example of counterfactuals being in the map rather than the territory, but the map is itself still in the territory (as I'm sure you know), so there's no clear sense in which counterfactuals and the models that enable them are not physical processes. Yes, we can apply an abstraction to temporarily ignore the physical process, which is maybe what you mean to get at, but it's still a physical process all the same.
It seems to me maybe the interesting thing is whether you can talk about a search algorithm in terms of particular kinds of abstractions rather than anything else, which if you go far enough around comes back to your position, but with more explained.
For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.
+1 to this and excited and happy to hear about this update in your view!
Ah, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.
It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left with regret that I put in too much effort.
Most examples I can think of that look like potential anti-Goodharting seem the same: I don't mind that I overshot the target, but I do mind that I wasn't as efficient as I could have been.
Maybe I'm missing something, but this seems already captured by the normal notion of what Goodharting is in that it's about deviation from the objective, not the direction of that deviation.
Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with).
In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended models might come to bite you.
Let's use your example of a model that reports "no one entered the data center". I might think the right answer is that "no one entered the data center" when I in fact know that physically someone was in the datacenter but they were an authorized person. If I'm reporting this in the context of asking about a security breach, saying "no one entered the data center" when I more precisely mean "no unauthorized person entered the data center" might be totally reasonable.
In this case there's some ambiguity about what reasonably counts as "no one". This is perhaps somewhat contrived, but category ambiguity is a cornerstone of linguistic confusion and where I see the division between instrumental and intended models breaking down. I think there are probably some chunk of things we could screen off by making this distinction that are obviously wrong (e.g. the model that tries to tell me "no one entered the data center" when in fact, even given my context of a security breach, some unauthorized person did entered the data center), and that seems useful, so I'm mainly pushing on the idea here that your approach here seems insufficient for addressing alignment concerns on its own.
Not that you necessarily thought it was, but this seems like the relevant kind of issue to want to consider here.
I want to consider models that learn to predict both “how a human will answer question Q” (the instrumental model) and “the real answer to question Q” (the intended model). These two models share almost all of their computation — which is dedicated to figuring out what actually happens in the world. They differ only when it comes time to actually extract the answer. I’ll describe the resulting model as having a “world model,” an “instrumental head,” and an “intended head.”
This seems massively underspecified in that it's really unclear to me what's actually different between the instrumental and intended models.
I say this because you posit the intended model gives "the real answer", but I don't see a means offered by which to tell "real" answers from "fake" ones. Further, for somewhat deep philosophical reasons, I also don't expect there is any such thing as a "real" answer anway, only one that is more or less useful to some purpose, and since ultimately it's humans setting this all up, any "real" answer is ultimately a human answer.
The only difference I can find seems to be a subtle one about whether or not you're directly or indirectly imitating human answers, which is probably relevant for dealing with a class of failure modes like overindexing on what humans actually do vs. what we would do if we were smarter, knew more, etc. but also still leaves you human imitation since there's still imitation of human concerns taking place.
Now, that actually sounds kinda good to me, but it's not what you seem to be explicitly saying when you talk about the instrumental and intended model.
I wrote a research agenda that suggests additional work to be done and that I'm not doing.
Firstly, we don't understand where this logical time might come from, or how to learn it
Okay, you can't write a sentence like that and expect me not to say that it's another manifestation of the problem of the criterion.
Yes, I realize this is not the problem you're interested in, but it's one I'm interested in, so this seems like a good opportunity to think about it anyway.
The issue seems to be that we don't have a good way to ground the order on world states (or, subjectively speaking if we want to be maximally cautious here, experience moments) since we only ever are experiencing one moment at a time and any evidence we have about previous (or future) moments is something encoded within the present moment, say as a thought. So we don't have a fully justifiable notion of what it means for one moment to come before or after another since any evidence I try to collect about it is at some level indistinguishable from the situation where I'm a Botlzmann brain that exists for only one moment and then vanishes.
Of course we can be pragmatic about it, since that's really the only option if we want to do stuff, and we certainly are, hence why we have theories of time or causality at all. So ultimately I guess I agree with you there's not much to say here about this first problem, since at some point it becomes an unresolvable question of metaphysics, and if we build a robust enough model of time then the metaphysical question is of no practical importance anyway for the level of abstraction at which we are operating.