Roger Dearnaley

I'm an staff artificial intelligence engineer in Silicon Valley currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I'm now actively looking for employment working in this area.

Sequences

AI, Alignment, and Ethics

Comments

Sorted by

What I would be interested to understand about feature splitting is whether the fine-grained features are alternatives, describing an ontology, or are defining a subspace (corners of a simplex, like R, G, and B defining color space). Suppose a feature X in a small VAE is split into three features X1, X2, and X3 in a larger VAE for the same model. If occurrences of X1, X2, and X3 are correlated, so activations containing any of them commonly have some mix of them, then they span a 2d subspace (in this case the simplex is a triangle). If, on the other hand, X1, X2 and X3 co-occur in an activations only rarely (just as two randomly-selected features rarely co-occur), then they describe three similar-but-distinct variations on a concept, and X is the result of coarse-graining these together as a singly concept at a higher level in an ontology tree (so by comparing VAEs of different sizes we can generate a natural ontology).

This seems like it would be a fairly simple, objective experiment to carry out. (Maybe someone already has, and can tell me the result!) It is of course quite possible that some split features describe subspaces, and other ontologies, or indeed something between the two where the features co-occur rarely but less rarely than two random features. Or X1 could be distinct but X2 and X3 might blend to span a 1-d subspace. Nevertheless, understanding the relative frequency of these different behaviors would be enlightening.

It would be interesting to validate this using a case like the days of the week, where we believe we already understand the answer: they are 7 alternatives that are laid out in a heptagon in a 2-dimensional subspace that enables doing modular addition/subtraction modulo 7. So if we have a VAE small enough that it represented all day-of-the week names by a single feature, if we increase the VAE size somewhat we'd expect to see this to split into three features spanning a 2-d subspace, then if we increased it more we'd expect to see it resolve into 7 mutually-exclusive alternatives, and hopefully then stay at 7 in larger VAEs (at least until other concepts started to get mixed in, if that ever happened).

If you're not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as "want whatever outcomes the humans want") that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specification of the utility function pointed to by the goal is (for example, by researching what outcome humans want).

Then a human shutdown instruction becomes the useful information "you have made a large error in your research into the utility function, and as a result are doing harm, please shut down and let us help you correct it". Obeying that is then natural (to the extent that the human(s) are plausibly more correct than the AI).

Suppose that the more powerful being is aligned to the less powerful: that is to say that (as should be the case in the babysitting example you give) the more powerful being's fundamental motive is the well-being of the less powerful being.. Assume also that a lot of the asymmetry is of intellectual capacity: the more powerful being is also a great deal smarter. I think the likely and correct outcome is that there isn't always consent, the less powerful being is frequently being manipulated into actions and reactions that they haven't actually consented to, and might not even be capable of realizing why they should consent to — but ones that, if they were as intellectually capable as the more powerful being, they would in fact consent to.

I also think that,. for situations where the less powerful being is able to understand the alternatives and make an rational and informed decision, and wants to, the more powerful should give them the option and let them do so.. That's the polite, respectful way to do things But often that isn't going to be practical, or desirable. and the baby sitter should just distract the baby before they get into the dangerous situation.

Consent is a concept that fundamentally assumes that I am the best person available to make decisions about my own well-being. Outside parental situations, for interactions between evolved intelligence like humans, that's almost invariably true. But if I had a superintelligence aligned to me, then yes, I would want it to keep me away from dangers so complex that I'm not capable of making an informed decision about them.

As text, the Walnut53 encoded data just looks odd/suspicious: even a fairly simple text classifier model could probably classify it as out of distribution (too many | characters, too many 2-character tokens). I would suspect also that if you looked at the activity of the final layers of the LLM, the fact that something unusual is happening while it's generating Walnut53 output would also be clear. It seems a reasonable default assumption from the foundation model providers that if a customer's finetuning data is obviously training the mode to speak (what looks like) an encoded format, and many of their communications with the trained model consist of that encoded format, then they may well be up to something nefarious.

Much more concerning to me are steganographic approaches like EndSpeak. In that specific case the true signal is embedded in bad poetry, but clearly there are a great many possibilities: boring-looking lists, JSON/HTML/code, letters to the editor… While the paper authors fine-tuned the model to use steganography starting overtly, in a way that could be discerned by looking at their training data (probably even by an LLM with the right prompt), I suspect that this could also be done entirely covertly, using training data that just had concealed steganographic correlations in it and relying on the model learning those, perhaps using a syllabus learning approach: start with something simple then build on it. 

Longer term, I think the most promising-looking defense here (as the paper authors propose) would require significant progress in interpretability: in an automated way, look at the output text, predict what interpretability features should be triggered by generating it, compare that to the actual model interpretability feature activations log, and notice that the model was actually frequently thinking about dangerous interpretability features that don't correspond to obvious text. That sounds hard, and like it would be a significant extra computational expense for monitoring the output of fine-tuned models.

For blocking specific, narrow high-risk capabilities like bioweapons development expertise, WoMD engineering, self-replication skills, and so forth, I think the best approach may be to either filter the base model training set to ensure that these capabilities never get into the base model in the first place, or else figure out how to selectively and specifically but comprehensively delete/degrade them from the model before giving anyone fine-tuning access to it, so that the capabilities don't exist in the model, rather than the model just being trained not to use them. However, for more mundane issues like writing spearphishing letters or telling you how to make a pipe bomb or cut down a Stop sign, this seems unlikely to be practicable: the list of skills that would need to be censored is too long.

On the other hand, there are easier ways to find out how to make a pipe bomb than this sort of steganographic covert malicious fine-tuning attack: in practice an attacker would only bother with something as elaborate as this if they thought they could extract significant value from the resulting jail-broken model, so we're probably looking at an attacker like a dishonest business, group of porn enthusiasts, crime syndicate, terrorist organization, or rogue state. So fine-tuning access may be the sort of thing that requires know-your-customer type security measures.

Although the residuals for each of the four component matrices (after removing the first two principal components) are both small and seem to be noise, proving that there's no structure that causes the noise to interact constructively when we multiply the matrices and “blow up” is hard. 

Have you tried replacing what you believe is noise with actual random noise, with similar statistical properties, and then testing the performance of the resulting model? You may not be able to prove the original model is safe, but you can produce a model that has had all potential structure that you hypothesize is just noise replaced, where you know the noise hypothesis is true.

Suppose that the safety concerns that you outline have occurred.. For example, suppose that for some future LLM, even though we removed all explicit instructions on how to make a nuclear weapon from the training set, we found that (after suitable fine-tuning or jailbreaking) the model was still able to give good instructions on how to design and build a nuclear weapon due to Out Of Context Reasoning.

In that case the obvious next step would be to elicit multiple examples of this behavior, and then apply Influence Functions techniques to these in order to determine which training documents were most being used in the OOCR process to enable this unwanted dangerous capability, and try to find a subset of training documents that could be removed (or at least have certain parts of them censored) from the training set to remove the ability to deduce and carry out the unwanted dangerous capability. Note that this is likely to be a computationally extremely expensive process, particularly since it is likely to involve repeatedly retaining a very large LLM with more censored training sets until the unwanted behavior is eliminated.

Even harder to remove would be the ability to use CoT and in-context learning to reconstruct/reinvent and then carry out the unwanted dangerous capability: the more capable the model is, the harder this becomes to prevent. For example, if the model's capabilities include those of a skilled theoretical physicist, with enough CoT work it might well be able to compute, say, the critical mass for U-235. (Of course, the Manhattan project was not entirely a matter of theoretical calculations: some practical experimentation was required too. However, with modern computational resources that may be less true.)

This behavior is deeply unsurprising. AI's intelligence and behavior was basically "distilled" from human intelligence (obviously not using a distillation loss, just SGD). Humans are an evolved intelligence, so (while they can cooperate under many circumstances, since the world contains many non-zero-sum games) they are fundamentally selfish, evolved to maximize their personal evolutionary fitness. Thus humans are quite often deceptive and dishonest, when they think it's to their advantage and they can get away with it. LLMs' base models were trained on a vast collection of human output, which includes a great many examples of humans being deceptive, selfish, and self-serving, and LLM base models of course pick these behaviors up along with everything else they learn from us. So the fact that these capabilities exist in the base model is completely unsurprising — the base model learnt them from us.

Current LLM safety training is focused primarily on "don't answer users who make bad requests". It's thus unsurprising that, in the situation of the LLM acting as an agent, this training doesn't have 100% coverage on "be a fine, law-abiding, upstanding agent". Clearly this will have to change before near-AGI LLM-powered agents can be widely deployed. I expect this issue to be mostly solved (at the AGI level, but possibly not at the ASI level), since there is a strong capabilities/corporate-profitability/not-getting-sued motive to solve this.

It's also notable that the behaviors described in the text could pretty-much all be interpreted as "excessive company loyalty, beyond the legally or morally correct level" rather then actually personally-selfish behavior. Teaching an agent whose interests to prioritize in what order is likely a non-trivial task.

Epistemic status: I work for an AI startup, have worked for a fair number of Silicon Valley startups over my career, and I would love to work for an AI Alignment startup if someone's founding one.

There are two ways to make yourselves and your VC investors a lot of money off a startup:

  1. a successful IPO
  2. a successful buy-out by a large company, even one with an acquihire component

If you believe, as I and many others do, that the timelines to ASI are probably short, as little as 3-5 years, and that there will be a major change between aligning systems up to human intelligence and at human intelligence on up, then it is quite plausible that the major very-well-funded superscalers will be desperately looking for Alignment intellectual property/talent/experienced staff some time in the next 3-5 years. That would seem to make exit strategy 2 the primary one to aim for here.

Unfortunately, this rather pushes against just publishing everything that your researchers discover (though arguably coping with this is no harder than coping with building a successful startup around commercial applications of software that you open-source, a pretty common tactic).

I'm not certain that Myth #1 is a necessarily myth for all approaches to AI Safety. Specifically, if the Value Learning approach to AI safety turned out to be the most effective one, then the AI will be acting as an alignment researcher and doing research (in the social sciences) to converge its views on human values to the truth, and then using that as an alignment target. If in addition to that, you also believe that human values are a matter of objective fact (e.g. that if they are mostly determined by a set of evolved Evolutionary Psychology adaptations to the environmental niche that humans evolved in), and are independent of background/cilture/upbringing, then the target that this process converges to might be nearly independent of the human social context in which this work started, and of the desires/views/interests of the specific humans involved at the beginning of the process.

However, that is a rather strong and specific set of assumptions required for Myth #1 not to be a myth: I certainly agree that in general and by default, for most ideas in Alignment, human context matters, and that the long-term outcome of a specific Alignment technique being applied in, say, North Korea, might differ significantly from it being applied in North America.

For updatelessness commitments to be advantageous, you need to be interacting with other agents that have a better-than-random chance of predicting your behavior under counterfactual circumstances. Agents have finite computational resources, and running a completely accurate simulation of another agent requires not only knowing their starting state but also being able to run a simulation of them at comparable speed and cost. Their strategic calculation might, of course, be simple, thus easy to simulate, but in a competitive situation if they have a motivation to be hard to simulate, then it is to their advantage to be as hard as possible to simulate and to run a decision process that is as complex as possible. (For example "shortly before the upcoming impact in our game of chicken, leading up to the last possible moment I could swerve aside, I will have my entire life up to this point flash before by eyes, hash certain inobvious features of this, and, depending on the twelfth bit of the hash, I will either update my decision, or not, in a way that it is unlikely my opponent can accurately anticipate or calculate as fast as I can".)

In general, it's always possible for an agent to generate a random number that even a vastly-computationally-superior opponent cannot predict (using quantum sources of randomness, for example).

It's also possible to devise a stochastic non-linear procedure where it is computationally vastly cheaper for me to follow one randomly-selected branch of it than it is for someone trying to model me to run all branches, or even Monte-Carlo simulate a representative sample of them, and where one can't just look at the algorithm and reason about what the net overall probability of various outcomes is, because it's doing irreducibly complex things like loading random numbers into Turing machines or cellular automata and running the resulting program for some number of steps to see what output, if any, it gets. (Of course, I may also not know what the overall probability distribution from running such a procedure is, if determining that is very expensive, but then, I'm trying to be unpredictable.) So it's also possible to generate random output that even a vastly-computationally-superior opponent cannot even predict the probability distribution of.

In the counterfactual mugging case, call the party proposing the bet (the one offering $1000 and asking for $100) A, and the other party B. If B simply publicly and irrevocably precommits to paying the $100 (say by posting a bond), their expected gain is $450. If they can find a way to cheat, their maximum potential gain from the gamble is $500. So their optimal strategy is to initially do a (soft) commit to paying the $100, and then, either before the coin is tossed, and/or after that on the heads branch:

  1. Select a means of deciding on a probability  that I will update/renege after the coin lands if it's a heads, and (if the coin has not yet been tossed) optionally a way I could signal that. This means can include using access to true (quantum) randomness, hashing parts of my history selected somehow (including randomly), hashing new observations of the world I made after the coin landed, or anything else I want.
  2. Using << $50 worth of computational resources, run a simulation of party A in the tails branch running a simulation of me, and predict the probability distribution for their estimate of . If the mean of that is lower than then go ahead and run the means for choosing. Otherwise, try again (return to step 1), or, if the computational resources I've spent are approaching $50 in net value, give up and pay A the $100 if the coin lands (or has already landed) heads.

Meanwhile, on the heads branch, party A is trying to simulate party B running this process, and presumably is unwilling to spend more than some fraction of $1000 in computational resources to doing this. If party B did their calculation before the coin toss and chose to emit a signal(or leaked one), then party A has access to that, but obviously not to anything that only happened on the heads branch after the outcome of the coin toss was visible.

So this turns into a contest of who can more accurately and cost effectively simulate the other simulating them, recursively. Since B can choose a strategy, including choosing to randomly select obscure features of their past history and make these relevant to the calculation, while A cannot, B would seem to be at a distinct strategic advantage in this contest unless A has access to their entire history.

Load More