Transparency for Generalizing Alignment from Toy Models

Johannes C. Mayer

Status: Some rough thoughts and intuitions.

🪧 indicates signposting

TL;DR If we make our optimization procedures transparent, we might be able to analyze them in toy environments to build understanding that generalizes to the real world and to more powerful, scaled-up versions of the systems.

🪧 Let's remind ourselves why we need to align powerful cognition.

One path toward executing a pivotal act is to build a powerful cognitive system. For example, if a team of humans together with the system can quickly do the necessary research to figure out how to do brain emulation correctly, and then build a functioning simulator and upload a human into it, we would win.

I expect that the level of capability required of the system to perform the necessary tasks, will make the system dangerous if misaligned. For this path to work, we need to be able to guide the optimization of the system.

🪧 Now I'm going to introduce a particular kind of model that I think would be more transparent than modern DL, though the overall argument should apply to any system that makes its internal workings highly transparent to us. For example, if we would get really good interpretability tools for neural networks, then the argument would apply there too.

Let's assume we have figured out how to create an algorithm that builds a predictive model of the world, and that both the algorithm and the resulting world model are transparent to us. To me, it seems likely that once we have such an algorithm, it would be relatively easy to put another algorithm on top that uses the world model to determine action sequences that result in particular outcomes in the world. If the world model is transparent, this will also make this decision procedure more transparent.

I'm imagining here that we know the explicit line-by-line source code that does the building of the world model. The same goes for the algorithm that uses the world model to determine what actions to perform. This is in contrast to only knowing an algorithm that optimizes a computational structure, such as a neural network until the structure becomes good at performing some task, but where we don't understand what is going on internally. However, I am not presuming that we precisely understand all of the internal workings of these algorithms, that is what we want to end up with eventually. And importantly, I'm not assuming that these algorithms already have the relevant alignment properties that we would want.

One way that the world model could be transparent is if it decomposes the world in similar ways humans do. For example, there might be a particular concept corresponding to a chair and a particular concept corresponding to a table in the world model. These concepts might be thought of as objects containing a predictive model (or multiple), and other data. If we get the world modeling algorithm right, I think we would be able to inspect and understand these objects, if the things they are modeling are simple enough.

I do not expect that a human can look at the world model of the real world, of a full-fledged AGI, and understand it in the relevant time frame. It will simply be too big to be comprehended. Possibly even most concepts on their own would be too complicated to be comprehended in full, even when optimized for being understandable to humans.

🪧 So, let us now consider how toy models can be helpful here.

I expect that we could look at simple environments and study the world modeling algorithm there, together with the decision procedure that is layered on top of the world model. For example, we could try to elicit misalignment on purpose. We might give it an objective function that is slightly different from what we want the system to do eventually. Then we can study the system and understand what sorts of mechanisms we would need to prevent goodharting.

You could of course apply techniques such as Quantilizers by default, but I would expect that if the world model is transparent and the optimization procedure uses the world model in a very direct way, that we would be able to find structural properties that would correspond to the system not optimizing too hard. Quantilization works on the level of the objective function. But if we have a white box system, we can start to work with the internal structure of the algorithms. It might be easier to understand how one specific algorithm needs to be changed, such that it has various desirable alignment properties, compared to figuring out an objective function that works in general for any powerful optimizer.

For example, consider that we want to create a myopic agent. We could try to figure out a general objective function that would make any general optimizer myopic. However, it might be easier to understand how a concrete system would need to be changed to make it myopic. Perhaps you could identify where exactly the program uses the world model to predict how the future will look like. Then you could limit the number of times this predictive model can be called on outputs of itself.^[1]

The idea is to use these toy environments to analyze the algorithms such that this understanding generalizes to when we run the system in the real world.

🪧 Let's make this more concrete by drawing an analogy to how this might look like by considering how we might do this with another much simpler algorithm.

Consider the following example: If you program a list sorting algorithm, you can study it by looking at what is going on for small lists. Either you look at the algorithm and think about what is going on, or you could even step through it using a debugger. You can then use your observations to understand how the list sorting algorithm works and how it will behave for larger lists. You can build intuitions that generalize. Here I am considering that you are not just looking at a bunch of examples, notice that all of them behave as you want, and then are satisfied. Instead, I am thinking of the scenario where you carefully study the program and build up your understanding to such an extent that you know that the list-sorting algorithm will work for any possible input you could give it.

Note that I am intentionally talking about intuition, and not about mathematical proof. Sure, you can formally verify that a particular list sorting algorithm will always produce a sorted list. But doing this kind of analysis is often tedious. When I'm writing a list sorting algorithm, and want to be confident that the output will always be sorted, I don't need to write the algorithm in Coq. I can work with my intuitions and carefully look at the program that I am writing. I'm not saying don't use mathematical proof. Instead, I'm saying that it would be very good if we can put our programs into a form that is conducive to building correct, deep-reaching intuitions that generalize widely.

Similarly, we might be able to study the property of being inner aligned (or any other property really) in the toy set up and understand what properties of the internals of the system would correspond to being inner aligned. And not just understand them superficially, but grok them deeply in a way that makes us grasp how to build the system such that it would keep being inner aligned even if we scale up the system and put it in the real world.^[2]

Edit: Also see this comment for further clarification.

This is more of a rough illustrative example. It probably doesn't correspond to getting the notion of myopia we would want, but it seems to roughly go in the right direction. ↩︎
Inner alignment is probably not a property we would verify directly, but rather we would decompose it and verify various sub-properties. So inner alignment here is just an illustrative example. ↩︎

4

Transparency for Generalizing Alignment from Toy Models

4