* Rather, making deontology play well with differentiable systems trained end-to-end.
This post is part of my hypothesis subspace sequence, a living collection of proposals I'm exploring at Refine. Preceded by oversight leagues, and followed by representational tethers.
TL;DR: An ideological inference engine is a mechanism for automatically refining a given propositional representation of human values (e.g. a normative charter, a debate stance) in an attempt to disambiguate and generalize it to novel situations. While the inference algorithm and the seed representation form the crux of the system, a multi-modal entailment verifier is employed to order possible agent behaviors based on their compatibility with the estimated ideology. This proposal then describes a way of instilling deontological drives in prosaic systems while maintaining the appeal of end-to-end differentiation. Ideological inference engines draw on ideas from traditional expert systems, but replace much of the clunky symbolic manipulation with contemporary LLMs and NLI models.
Ideological inference engines are a slightly more general framework than oversight leagues, in the sense they rely on several global assumptions, but each more concrete instance of the proposal requires new assumptions when designing the seed representation, the inference algorithm, and the entailment verifier. Here's a non-exhaustive list of global assumptions:
Assumption 1, "Small Seed To Big Tree": Given a suitable inference algorithm and a finite propositional representation of human values, it is possible to estimate human values arbitrarily well given arbitrary amounts of compute. "Arbitrarily well" refers to there being an arbitrarily low error in the estimation. In the limit, the seed knowledge base would grow into the True Name of human values in propositional form.
Assumption 2, "Linear Capability Ordering": Similar to the assumption invoked in oversight leagues, this states that a system composed of a fixed knowledge base and a fixed entailment verifier would eventually be gamed by an agent whose capability is constantly increasing. This is due to the more complex agent becoming able to exploit the inaccuracies of the knowledge base with respect to actual human values.
Assumption 3, "Quantifiable Propositional-Behavioral Gap": The compatibility of a proposed behavioral sequence and a propositional representation of human values is computable. There is a fundamental relation between one's values and one's actions, and we can measure it. A close variant appears to be invoked in the CIRL literature (i.e. we can read one's values off of one's behavior) and in Vanessa Kosoy's protocol (i.e. we can narrow in on an agent's objective based on its behavioral history).
Assumption 4, "Avoidable Consequentialist Frenzy": It's possible to prevent the agent-in-training from going on a rampage in terms of an outcome-based objective (e.g. get human upvotes) relative to a simultaneous process-based objective (i.e. the present deontological proposal). This might be achieved by means of myopia or impact measures, but we're not concerned with those details here — hence, an assumption.
Together, those global assumptions allow for a mechanism for approaching the human values in propositional form, before employing the resulting representation to nudge an ML model towards being compatible with it.
Ideological inference engines (IIE) are a modular framework for implementing such a mechanism. Each such system requires the following slots to be filled in with more concrete designs, each of which has attached assumptions related to whether it's actually suited to its role:
By filling all the three slots of an IIE with concrete designs, you get a pipeline which is able to measure the compatibility of an ML model's behaviors with a deontology being constantly refined. A straight-forward way of quickly using this to actually nudge the model being trained is to swap out the human feedback in RLHF for the compatibility estimate yielded by the present system. It's relevant to note that this "differentiable deontology" is only meant to provide one part of the final objective function, with the "Avoidable Consequentialist Frenzy" keeping other parts from completely overpowering it as a drive. IIEs are meant to be employed as correcting terms attached to a main objective, at least during takeoff.
The inference algorithm would regularly be employed to improve the estimate of human values by updating the knowledge base, so that the whole system stays adaptive. However, the inference algorithm can only bring the representation that close to the ground truth, without ever reaching it exactly in finite time. Its goal is to bring it as close as necessary for the main model being trained not to be able to game it during the training process.
Aside: The present proposal is a generalization of two previous ones, termed memetic colonies and deontic arrays. This leaks somewhat through the different available choices for ingredients required to implement an IIE, as they're somewhat correlated:Debate Stance (KB) + Debate (⊢) + Overton Probing (⊨) = memetic coloniesNormative Charter (KB) + Counterfactual Cross-validation (⊢) + Multi-modal Sequence Models (⊨) = deontic arraysThat said, I believe the present generalized proposal can still yield generative effects (i.e. combinatorial mix-and-matching possibilities), despite certain components working better together than others. Plus, it allows new choices for each of the three components which didn't intuitively fit with the two previous proposals.However, the generalized proposal might inadvertently obscure some useful features of the previous ones. One feature whose loss feels unfortunate is the idea of looking at the propositions comprising the knowledge base as discrete individual directives nudging the ML model in different ways (e.g. attractors, repellers, dipoles, etc. across state space or model space), whose associated "field theory" might yield clues as to the structure of the optimization surface (e.g. distribution of local minima, workable stable orbits, etc.). Those nudges get obscured somewhat through the merging of directives into a unified knowledge base before considering their influence, rather than after. I plan to skim through a textbook on electrodynamics to get a better sense of whether there's value here, and try to perform a conceptual rearrangement if so.
Aside: The present proposal is a generalization of two previous ones, termed memetic colonies and deontic arrays. This leaks somewhat through the different available choices for ingredients required to implement an IIE, as they're somewhat correlated:
That said, I believe the present generalized proposal can still yield generative effects (i.e. combinatorial mix-and-matching possibilities), despite certain components working better together than others. Plus, it allows new choices for each of the three components which didn't intuitively fit with the two previous proposals.
However, the generalized proposal might inadvertently obscure some useful features of the previous ones. One feature whose loss feels unfortunate is the idea of looking at the propositions comprising the knowledge base as discrete individual directives nudging the ML model in different ways (e.g. attractors, repellers, dipoles, etc. across state space or model space), whose associated "field theory" might yield clues as to the structure of the optimization surface (e.g. distribution of local minima, workable stable orbits, etc.). Those nudges get obscured somewhat through the merging of directives into a unified knowledge base before considering their influence, rather than after. I plan to skim through a textbook on electrodynamics to get a better sense of whether there's value here, and try to perform a conceptual rearrangement if so.
The general structure of this proposal is directly inspired by the structure of traditional expert systems, which also feature a knowledge base, an inference engine, and a way of determining entailment. Whenever touching on a field as grounded in theory as logic and reasoning through an alignment proposal, it seems appropriate to look for potential guarantees which could be transferred over. In oversight leagues, game theory and provable robustness feel like potential candidates for sources of those warm and fuzzy guarantees. For the present one, the following concepts seem relevant to look into further:
Can't the ML model being trained game the IIE?
Given a fixed knowledge base, any given agent of increasing capability will at some point be able to game the IIE. However, the way this proposal handles gaming is by actively patching loopholes in the input deontology over time. By constantly making the objective associated with IIE more nuanced, the model-in-training would find it more and more difficult to exploit blindspots in the aggregate system, especially in the propositions. While in oversight leagues the locus of objective adaptability was the evaluator as an ML model being constantly fine-tuned, in the present proposal it is in the constantly expanding knowledge base. Whether or not the inference algorithm can fulfill its goal in refining estimates of human values becomes a crux.
However, there are also static components involved in the present proposal. The various inference algorithms and entailment verifiers proposed above are quite fixed. Not only are they fixed in algorithmic flow (e.g. repeatedly Babble and Prune), but also in the frozen ML models they involve (e.g. pretrained NLI models). This makes for some fragile attack surface in the proposal. It's unclear whether adaptability at the level of the knowledge base is enough to maintain the whole IIE above gaming, given that two of its three components are static. Future variants of IIE might involve learning the inference algorithm and entailment verifier as subtasks.
What if the ML model recursively self-improves and overtakes IIE capabilities?
The model-in-training might plausibly exhibit RSI, especially given high situational awareness. The only defense of the present proposal against this threat is to match the improvement with an analogous one for the IIE, which currently can only happen through the inference algorithm extending the knowledge base and thus refining the objective. Heuristics for guiding the expansion towards propositions relevant to the current behavior of the current model (e.g. counterfactual cross-validation) might help better align the two improvements. I just realized that the very consideration of such heuristics as useful runs against the "Linear Capability Ordering" assumption, leading to inconsistencies in the present version of IIE.
What if the ML model tampers with the IIE?
The model-in-training might plausibly try to hack into the server farm which would run the IIE system and hijack its reward center. The "Avoidable Consequentialist Frenzy" assumption invokes other lines of work to protect against rampages in terms of the main objective, as opposed to the auxiliary objective implemented through an IIE. However, this tampering scenario highlights the fact that tampering with the deontology can be a consequentialist move in itself, without even taking the main obejctive into consideration. This makes it clear that the same mechanisms invoked for stabilizing the main objective (e.g. myopia, impact measures) should be used to protect against rampages in the deontological objective. This hints at the awkward need of supplementing the deontological refinement carried out by the inference algorithm with such more general mechanisms.
However, the IIE itself might help enforce behavioral norms which radically discourage the model-in-training from tampering with the IIE, without the need for employing other tricks. Alternatively, the goal of running the IIE step-by-step, as explicitly implemented by humans, might be a goal in itself. Any slight divergence from running the authentic IIE line by line would result in an infinitely large penalty. Goal preservation becomes an explicit goal, but then what if this meta-goal of not tampering with its original goal becomes the object of tampering? Adding a goal preservation goal for its first goal preservation goal only shifts the problem upwards one more step. There's a need for folding this meta chain on itself somehow.
As another approach to tackling wireheading, one might consider the fact that the IIE itself can be fully deterministic. Even the components which involve LLMs decoding stochastically can fix a seed for the PRNG and get perfect reproducibility. This means that the IIE can be run on different sealed systems and the hash of the expanded knowledge base(s) can be reported. Inter-annotator agreement schemes would then be employed to detect tampering. However, if the risks associated with hacking one IIE replica are correlated with the risks of hacking the others, then the added benefit of replication collapses. Perhaps increasingly well-protected instances could be brought together to check for consensus, where the model hacking some of them would trigger warnings. But then the model would hold off until being capable of hacking all of them, side-stepping disagreement warnings entirely — all instances would be hacked, preserving consensus.
What if the knowledge base grows unwieldy in size?
For this one, I'll use a similar reply to the question of alignment tax in oversight leagues. The knowledge base would indeed grow over time to stay ahead of the model. Any less growth might not be enough to avoid gaming the deontological objective implemented through IIE.
However, the knowledge base would just be text. A lot of text, but just text. The bigger concern is whether the computational cost of the inference algorithm and the entailment verifier grows with the knowledge base. For most choices listed above, the complexity is linear with respect to the size of the knowledge base. There might even be ways of reducing it to sublinear, with the two components first filtering the knowledge base so as to only work with a relevant subset of propositions, say through domain-specific heuristics (e.g. counterfactual cross-validation) or general measures of similarity. This brings in new issues, though. A detailed analysis of the complexity of the various choices for ⊢ and ⊨ doesn't feel like a priority — there are way bigger issues with this proposal than the estimated alignment tax.
Doesn't this proposal tacitly assume moral absolutism?
The seed knowledge base would be finite. It would define the deontology which the inference algorithm then refines and the entailment verifier enforces. While humans could choose what goes into the knowledge base democratically, it's unclear whether the result would be anything but completely contradictory. It might also be possible to have multiple seed deontologies expanded in parallel which get a say in the model's behavior proportional to the number of moral patients subscribed to said deontology. While all this would be nice, effectively enforcing one deontology is many OOMs harder than enforcing multiple of them, as a technical challenge. Hence, the proposal (and potential future variants) will likely focus on "just" enforcing one.
What if assumption 1 ("Small Seed To Big Tree") doesn't hold?
Then it's impossible to get an arbitrarily accurate estimate of human values with arbitrary amounts of compute involved in modeling. There's just not enough information to help disambiguate the seed into what humans truly want. This appears similar to image superresolution or related upscaling tasks, where there are multiple high-resolution images which can map to the same low-resolution one. The upscaler is forced to choose one among an infinitely large set of valid items which downscale to its input. There is simply not enough information to pinpoint the right one. Following a transition function like the laws of physics to grow a tree from a seed turns out not to be enough here.
In this case, we might be forced to rely on the existence of one disambiguation without having a guarantee on its uniqueness. With this stunted IIE, we are still able to find a dentology which successfully "downscales" to the input we specified, but isn't exactly the one we possess. However, the fact that many choices of inference algorithms build on mountains of implicit knowledge about humans stored in LLMs gets us better off than the arbitrary choice of successfully-downscalable options. We can channel autoregressive coherence and NLI as nudges towards generally better choices, despite the best one remaining improbable.
What if assumption 2 ("Linear Capability Ordering") doesn't hold?
Then the model-in-training might game the IIE in some particular way, despite it generally being less advanced overall. The IIE might be forced to go over the top and make sure it's way more advanced in general so as to reduce the chances of being gamed in any way. In other words, if the skill profiles of the two on a radar chart exhibit many spikes, making the IIE's one way larger helps reduce the dimensions on which the model is ahead. Unfortunately, this increases the alignment tax and loses the guarantees of a hard ordering.
What if assumption 3 ("Quantifiable Propositional-Behavioral Gap") doesn't hold?
Then we can't implement a valid entailment verifier to help gauge the compatibility of behaviors with the propositional knowledge base. One might wonder: Why go to the trouble of making the knowledge base propositional in nature only for the entailment verifier to later be tasked with relating it to behaviors? Why not make the knowledge base behavioral in the first place?
There might be some value to this approach. The knowledge base would contain behaviors introduced by humans, rather than propositions. The job of the entailment verifier would be simpler, as the knowledge base and the target of verification would share the same modality. However, it's unclear how behaviors can be said to support other behaviors, except by means of precedents being interpolated. In contrast, the notion of propositional premises supporting a hypothesis is somewhat better defined. That said, the line between an action sequence and a subword sequence itself gets blurry as you consider agent simulacra manifesting in a linguistic universe, complete with a transition rule resembling the laws of physics. Most work around RLHF also reframes the subwords available to a LLM as possible actions, its parameters defining its policy as a guide for how to act in different context. The distinction approaches a moot point.
Relatedly, one might wonder: Why go to the trouble of having humans translate their implicit values in language, when there are a host of neuroimaging techniques available? Why not make a knowledge base of neural dynamics, possibly reduced in dimensionality to some latents?
Similar to the awkwardness of inferring new valid behaviors from past behaviors, inferring new valid thought patterns from past ones is very ill-defined. Barring all the limitations of current neuroimaging techniques in terms of spatial and temporal resolution, cost, portability, etc. it's unclear how to implement a compatible inference algorithm, except perhaps for the rudimentary Babble. However, the entailment verifier wouldn't face a more difficult challenge than in the propositional-behavioral setup, as it would need to bridge the neural-behavioral gap instead using multi-modal techniques.
What if assumption 4 ("Avoidable Consequentialist Frenzy") doesn't hold?
It was an honor to serve with you, have a nice timeline!
Are IIEs restricted to prosaic risk scenarios?
Although IIEs have been motivated by prosaic work, the proposal is entirely agnostic to the source of the behaviors to verify. In other words, even if the IIE would be built on a prosaic stack, the AI whose behavior should be aligned might be built on a different stack, given only that it is capable of optimizing for something (e.g. the deontological objective).
That said, even the IIE itself might run on a different stack. Case in point, IIEs have been inspired by a symbolic GOFAI stack supporting expert systems, parts of which have been replaced here with ML. This makes it plausible for other approaches to be able to populate the modular framework.