Early Thoughts on Ontology/Grounding Problems

[-]Gordon Seidoh Worley5y30

Interesting. I can't recall if I commented on the alignment as translation post about this, but I think this is in fact the key thing standing in the way of addressing alignment, and put together a formal model that identified this as the problem, i.e. how do you ensure that two minds agree about preference ordering, or really even the statements being ordered.

[-]Steven Byrnes5y30

Clearly the word “tree” is not a data structure representing the concept of a tree; it’s just a pointer. What’s the data structure?

I have some thoughts, but if they're right, then this would be getting into the domain of "detailed AGI algorithm design" which I don't think are productive to share given the state of the world viz. AGI prep, and if they're wrong (more likely anyway), there's similarly no point in sharing them.

[-]adamShimi5y20

I was not thinking about it before reading this comment, but even partial solutions to the problem in this post would probably both advance capabilities and safety. My first impression is that it helps build capability in a way that ensures more alignment, so it might be a net positive for alignment and safety. But that wouldn't necessarily hold if we also care about the misuse of aligned AI (which we probably should).

[-]adamShimi5y20

As always, nice post. The problem does seem central to many applications of abstraction indeed, especially assuming as you do that alignment reduces to translation between our ontology and the AI's ontology.

I especially like this summary/main takeaway:

Things should ultimately be groundable in abstraction from the low level, but it seems like we shouldn’t need a detailed low-level model in order to translate between ontologies.

Also, reading this, it seems like you consider that you have solved abstraction (you write about this being your next project). Is that the case, or are you just changing problem for a while to keep things fresh?

[-]johnswentworth5y30

At this point, I think that I personally have enough evidence to be reasonably sure that I understand abstraction well enough that it's not a conceptual bottleneck. There are still many angles to pursue - I still don't have efficient abstraction learning algorithms, there's probably good ways to generalize it, and of course there's empirical work. I also do not think that other people have enough evidence that they should believe me at this point, when I claim to understand well enough. (In general, if someone makes a claim and backs it up by citing X, then I should assign the claim lower credence than if I stumbled on X organically, because the claimant may have found X via motivated search. This leads to an asymmetry: sometimes I believe a thing, but I do not think that my claim of the thing should be sufficient to convince others, because others do not have visibility into my search process. Also I just haven't clearly written up every little piece of evidence.)

Anyway, when I consider what barriers are left assuming my current model of abstraction and how it plays with the world are (close enough to) correct, the problems in the OP are the biggest. One of the main qualitative takeaways from the abstraction project is that clean cross-model correspondences probably do exist surprisingly often (a prediction which neural network interpretability work has confirmed to some degree). But that's an answer to a question I don't know how to properly set up yet, and the details of the question itself seem important. What criteria do we want these correspondences to satisfy? What criteria does the abstraction picture predict they satisfy in practice? What criteria do they actually satisfy in practice? I don't know yet.

Not Just Easy Mode

After poking at these problems a bit, they usually seem to have an “easy version” in which we fix a particular Cartesian boundary.

In the utility function translation problem, it’s much easier if we declare that both models use the same Cartesian boundary - i.e. same input/output channels. Then it’s just a matter of looking for functional isomorphism between latent variable distributions.

For correspondence theorems, it’s much easier if we declare that all models are predicting exactly the same data, or predict the same observable distribution. Again, the problem roughly reduces to functional isomorphism.

Similarly with distributed models/learning: if a bunch of agents build their own models of the same data, then there are obvious (if sometimes hacky) ways to stitch them together. But what happens when they’re looking at different data on different variables, and one agent’s inferred latent variable may be another agent’s observable?

The point here is that I don’t just want to solve these on easy mode, although I do think some insights into the Cartesian version of the problem might help in the more general version.

Once we open the door to models with different Cartesian boundaries in the same underlying world, things get a lot messier. To translate a variable from model A into the space of model B, we need to “locate” model B’s boundary in model A, or locate model A’s boundary in model B, or locate both in some outside model. That’s the really interesting part of the problem: how do we tell when two separate agents are pointing to the same thing? And how does this whole "pointing" thing work to begin with?

Motivation

I’ve been poking around the edges of this problem for about a month, with things like correspondence theorems and seeing how some simple approaches to cross-ontology translation break. Something in this cluster is likely to be my next large project.

Why this problem?

From an Alignment as Translation viewpoint, this seems like exactly the right problem to make progress on alignment specifically (as opposed to embedded agency in general, or AI in general). To the extent that the “hard part” of alignment is translating from human concept-space to some AI’s concept-space, this problem directly tackles the bottleneck. Also closely related is the problem of an AI building a goal into a successor AI - though that’s probably somewhat easier, since the internal structure of an AI is easier to directly probe than a human brain.

Work on cross-ontology transport is also likely to yield key tools for agency theory more generally. I can already do some neat things with embedded world models using the tools of abstraction, but it feels like I’m missing data structures to properly represent certain pieces - in particular, data structures for the “interface” where a model touches the world (or where a self-embedded model touches itself). The indexing problem is one example of this. I think those interface-data-structures are the main key to solving this whole cluster of problems.

Finally, this problem has a lot of potential for relatively-short-term applications, which makes it easier to build a feedback cycle. I could imagine identifying concept-embeddings by hand or by ad-hoc tricks in one neural network or probabilistic model, then using ontology translation tools to transport those concept-embeddings into new networks or models. I could even imagine whole “concept libraries”, able to import pre-identified concepts into newly trained models. This would give us a lot of data on how robust identified abstract concepts are in practice. We could even run stress tests, transporting concepts from model to model to model in a game of telephone, to see how well they hold up.

Anyway, that’s one potential vision. For now, I’m still figuring out the problem framing. Really, the reason I’m looking at this problem is that I keep running into it as a bottleneck to other, not-obviously-similar problems, which makes me think that this is the limiting constraint on a broad class of problems I want to solve. So, over time I expect to notice additional possibilities which a solution would unblock.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

18

Early Thoughts on Ontology/Grounding Problems

18

Not Just Easy Mode

Motivation