This post was written under Evan Hubinger's direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
TL;DR: If we want to understand how AI models make decisions, and thus assess alignment, we generally want to extract their internal latent information. That information might be inaccessible to us by default and defy easy extraction. Resolving this problem might be critical to AI alignment.
Paul Christiano's conception of 'inaccessible information' seems to be an important and useful lens for interpreting many key concepts in AI alignment. In this post, I will attempt to explain (in an accessible way):
- What inaccessible information is;
- How inaccessible information might arise in ML models; and
- What insight inaccessible information offers for alignment research.
What is inaccessible information?
Roughly, inaccessible information is anything a model knows that an external agent cannot reliably learn. For instance, in the process of training, a sufficiently large model might acquire a complicated internal structure that is opaque to direct inspection. If we want to understand the process by which this 'black box' turns inputs into outputs, perhaps to learn new theories/heuristics, check for undesirable future behaviour or inspect the model's limitations (e.g. myopia), it seems natural to also train the model to produce explanations for its outputs. However, the model might learn to produce convincing explanations that are not necessarily true representations of its input-output process. If true and false but convincing explanations are indistinguishable to the training mechanism (e.g. a human with a reward button), the model may converge to the instrumental policy: 'give whatever explanation is most convincing'. This scenario is clearly counter to the desired outcome, as the model does not learn to give strictly true explanations and therefore its outputs are contingent on inaccessible information.
Generally, we might be unable to distinguish true and false but convincing explanations of a model's output if:
- We cannot directly check that an explanation is true (e.g. 'Alice refused icecream because she was depressed'); and
- We cannot build a trusted model that enables us to check the original model (e.g. a trusted short-term weather forecaster that we can extrapolate to test a long-term hurricane forecaster).
A related property to 'explainability' is 'transparency', also called 'interpretability'. A model is transparent to an external agent if inspection of the model is sufficient to understand how it operates. From the perspective of such an external agent, more transparent models generally contain less inaccessible information. Model transparency can be increased by:
- Improving model inspection tools;
- Training a model to be more accessible; and
- Structuring a model's architecture to be inherently accessible.
How might inaccessible information arise in ML models?
The training-induced structure of a machine learning model can be thought of as containing 'latent information' that shapes the model's output. Models that perform well on complicated tasks in diverse environments (e.g. 'What will Alice say next?') might acquire latent information that aids predictions (e.g. a model of Alice's thoughts). Larger models, or models that include highly compressed representations of the environment (i.e. big world, small map), are generally less accessible to direct inspection. A model's latent information might be difficult to render accessible by training it to give explanations or be otherwise transparent if, for example, there is:
- Training incentives that favour the instrumental policy;
- Learned deception; or
- Incompatibility with human world models.
Training incentives that favour the instrumental policy
In some situations, training might bias AI models towards dishonesty. An otherwise non-deceptive model trained to produce explanations for its outputs might converge to the instrumental policy if the model is powerful enough to produce false but convincing explanations and:
- Training incentivises a short policy description length; or
- True explanations are generally less convincing than false but convincing explanations.
The instrumental policy is short as it simply asks the model, 'What explanation is most convincing?' and leverages the already established power of the model to produce accessible output. A policy to produce true explanations likely requires a further 'translation mechanism' to explain latent information in an accessible way and thus has longer description length.
True explanations might be less convincing than false explanations sometimes if the latter seem more plausible due to incomplete data, or human biases and mistakes. For example, given the model prediction, 'Bob will refuse icecream,' the false explanation, 'Bob refuses icecream on cold days and today is cold,' might be a more convincing narrative about how the model operates than the true explanation, 'Bob refuses to eat icecream every third day and today is a refusal day.' In the latter case, the truth 'looks wrong'.
Christiano has proposed that the instrumental policy might be avoided with the following training process:
- Change the learning process so that it does not have a strong inductive bias towards "predict human answers," by allowing the complexity of the honest question-answering to "pay for itself" by constraining the space of possible human-models.
- Introduce a bias towards the intended model by using a more complex labeling process to answer questions where a human answers incorrectly.
- Be really careful to avoid penalizing honest answers, by only judging comparisons between two answers where we are confident one is better than the other and getting the model to help us.
If an AI model acquires the capabilities of an 'optimiser' and learns to be deceptive, it might try to circumvent attempts to access its latent information. Deceptive alignment of a model occurs when the training objective is at odds with the model's internally optimised 'mesa-objective', causing the model to act deceptively in order to influence the training process and avoid modification. A deceptively aligned model may be instrumentally motivated to obfuscate its internal mechanism and give false but convincing explanations for some outputs when elicited, including its deception status. If such a model can trick or avoid our tests for deception, the model's alignment status and mesa-objective are inaccessible information.
More generally, a model's alignment status might be inaccessible independent of deception if we don't possess tests that can assess mesa-objective inner alignment or training mechanisms that can guarantee a model avoids acquiring pseudo-alignment or even optimisation. Determining whether a model is performing optimisation might be inaccessible, particularly if the model is already deceptive and powerful.
Incompatibility with human world models
Even if we can reliably circumvent the failure modes associated with model honesty, we might fail to elicit true representations of latent information if an AI's potentially alien world model is not appropriately translated into human terms. The simplest method for an AI to answer a question (e.g. 'Why is Alice coughing?') could involve generating predictive observations and directly mapping these to states in the human world model. However, some potential states (e.g. 'Sickness caused by exotic nanobots') might not be entirely representable in the human world model (which excludes 'exotic nanobots'). In this case, the explanation might default to, 'What would a human say given this observation?' (e.g. 'Alice is sick') rather than, 'What would a human say if they had the AI's world model and observations?' (e.g. 'Nanobots!').
Ideally, an AI will generate observations, infer states in its world model, crucially map these to states in the human world model, and then produce an accessible and true answer (e.g. 'Alice is sick because of [uninterpretable N], which will confound conventional diagnosis and might signify a technological catastrophe!'). It is possible that the mapping between AI and human world models is lossy and some aspect of the AI's decision making process is inaccessible (i.e. the [uninterpretable N]). Christiano argues that such lossy mapping may not be a further problem for alignment, provided that the AI is competitive and consequences are accessible (e.g. even if exotic nanobot capabilities are not accessible, their impacts on humans are). Thus, an impact aligned AI could forestall exotic failures with exotic capabilities while merely avoiding accessible consequences.
What insight does inaccessible information offer for alignment research?
It seems to me that AI alignment efforts in general should not ignore the problem of inaccessible information. Developing powerful AI capabilities may require instilling in our models through training large amounts of decision-relevant latent information that is by default inaccessible. Verifying or incentivising acceptable model behaviour likely requires accessing latent information or preventing the occurrence of certain inaccessible properties, such as deception. Focusing on resolving inaccessible information may yield useful tools and techniques for producing robustly capable and aligned models.
I think that powerful AI systems are likely to be inaccessible by default because:
- Acquiring highly complex latent information likely confers a competitive advantage to an AI system in complex domains such as long-term forecasting or modelling humans. If AI systems with insufficient latent information are deployed in complex domains, they will likely be outcompeted by systems with more latent information, incentivising deployment of the latter. More latent information suggests a higher likelihood of inaccessibility, as there are larger circuits, a greater simplicity bias towards the instrumental policy and more potential for world model mismatches. Additionally, deception probably only occurs in sufficiently large and complex models.
- Rendering latent information accessible probably decreases the training competitiveness of building a powerful AI system. For example, training a model to be transparent or give true explanations for its outputs probably increases training time beyond that necessary for developing capabilities. If training competitiveness is prioritised, less accessible models are likely to be deployed.
- Generally powerful AI systems are likely to be mesa-optimisers, which may be instrumentally motivated to acquire deceptive behaviour and circumvent accessibility techniques. Even if aligned, mesa-optimisers may be more inscrutable to inspection than simpler models, such as decision trees, if the optimisation process is not hard-coded.
- Human values might be irreducibly complex and hard to define robustly. Even if an AI learns human values by default, the value function may be inaccessible, and thus difficult to verify, if human values are not simple to parametrise.
I think that proposed methods for building safe, advanced prosaic AI likely require resolving instances of inaccessible information because:
- Auditing models with transparency tools requires that such tools can extract meaningful information from massive neural networks. If model training does not select for transparency, creating tools to render sufficiently large and obfuscated networks transparent might be infeasible. If transparency tools can reliably check an 'easily measurable' property of models that guarantees nondeception (e.g. possibly myopia), this might be easier than checking for deceptive instances when the model is potentially already attempting to circumvent such tests.
- Generating a gradient away from deception with relaxed adversarial training seems contingent on whether the overseer (e.g. an amplified human) can inspect the model adequately enough to rule out unacceptable behaviour. Relaxed adversarial training might incentivise a model to become more transparent to the overseer, who could employ transparency tools to check an easily checkable property (e.g. myopia) and query the model for explanations. Overseers might be automated, but this seems risky to me if overseers are not themselves transparent to humans.
- Non-agentic microscope AI and STEM AI, while ideally not capable of capable of deceptive mesa-optimisation, will likely use complex latent information to accomplish tasks. If we desire to understand the mechanism by which these models produce insight, perhaps to error-check or learn new knowledge acquiring techniques, we will have to deal with inaccessible information. Microscope and STEM AI are probably not competitive in domains where AI-enhanced humans are outcompeted by powerful agentic AI, incentivising deployment of the latter, which seem likely to be more inaccessible by default.
Generally solving the problem of inaccessible information seems sufficient to solve a significant part of prosaic AI alignment, provided we can act on this information. If we can elicit true explanations for model output during training and inspect relevant model properties with transparency tools, we can hopefully steer the training process away from deceptive alignment and even catch deception. If our aligned AI model is competitive against unaligned AIs, the strategy stealing assumption implies that we can use it to capture flexible influence over the future, even if it has to develop exotic capabilities.
It seems debatable whether generally solving inaccessible information is necessary for prosaic AI alignment if there exists:
- Training mechanisms and architectural choices that produce transparent models;
- A checkable property of models that guarantees acceptability; or
- Training mechanisms that steer models away from unacceptability.
However, it seems possible to me that the above safety predicates might in general be 'probabilistic' guarantees against unacceptability. For instance, our best transparency tools and training mechanisms might only be able to guarantee 99% probability that a model possesses a checkable property that almost certainly implies acceptability. As deploying unacceptable AI is a 'high risk' scenario, we would ideally like to have the strongest possible guarantee of acceptability, something like Christiano's ascription universality: if an overseer understands everything that a model might reasonably know, the overseer is universal with respect to the model and all the model's latent information is accessible. Practically, universality is bounded by the overseer's epistemic model. If no further information could be reasonably found such that a human judge would trust the model over the overseer, the overseer 'epistemically dominates' the model. As of yet, there is no formal definition of ascription universality and it is unknown if universal overseers are possible in practise.
MIRI's recently announced Visible Thoughts Project aims to build a dataset of human explanations for AI dungeon master prompts and train a dungeon master AI to give accessible explanations for its output. Unless steps are taken to address the failure modes detailed in the previous section, it seems unlikely to me that the trained explanations will necessarily converge to true explanations. In particular, a near-term dungeon master AI might converge to the instrumental policy and treat 'humans who read dungeon master explanations' to a convincing fiction in which they roleplay as 'AI inspectors'.
Given the enormous stakes of AI alignment and the apparent ubiquity of inaccessible information in prosaic AI, striving for something close to ascription universality seems important. Improving the self-explaining and transparency of models, and the inspection power of trusted overseers is critical to universality and will likely benefit acceptability verification even if universality is impractical. I think the feasibility of addressing inaccessible information is a crux as to whether aligning prosaic AI is practical. Therefore, attempting to address inaccessible information will both aid prosaic alignment efforts and offer a possible test to rule out safe prosaic AI.