faul_sname — AI Alignment Forum

Why imperfect adversarial robustness doesn't doom AI control

If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.

Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.

On “first critical tries” in AI alignment

faul_sname1y44

Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control".

Non-myopia stories

faul_sname2y10

A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations

Why do you think that a model-based approach will outperform a model-free approach? We may just be using words differently here: I'm using "model-based" to mean "maintains an explicit model of its environment, its available actions, and the anticipated effects of those actions, and then performs whichever action its world model anticipates would have the best results", and I'm using "model-free" to describe a policy which can be thought of as a big bag of "if the inputs look like this, express that behavior", where the specific pattern of things the policy attends to and behaviors it expresses is determined by past reinforcement.^[1] So something like AlphaGo as a whole would be considered model-based, but AlphaGo's policy network, in isolation, would be considered model-free.

to immediately maximize future reward - rather than being forced to wait for rewards/selection to happen and incurring predictable losses before it can finally stop executing behaviors-that-used-to-stamp-maximize-but-now-no-longer-do-so-for-easily-predicted-reasons.

Reward is not the optimization target. In the training regime described, the outer loop selects for whichever stamp collectors performed best in the training conditions, and thus reinforces policies that led to high reward in the past.

"Evaluate the expected number of collected stamps according to my internal model for each possible action, and then choose the action which my internal model rates highest" might still end up as the highest-performing policy, if the following hold:

The internal model of the consequences of possible actions is highly accurate, to a sufficient degree that the tails do not come apart. If this doesn't hold, the system will end up taking actions that are rated by its internal evaluator as being much higher value than they in fact end up being.
There exists nothing in the environment which will exploit inaccuracies in the internal value model.
There do not exist easier-to-discover non-consequentialist policies which yield better outcomes given the same amount of compute in the training environment.

I don't expect all of those to hold in any non-toy domain. Even in toy domains like Go, the "estimations of value are sufficiently close to the actual value" assumption [empirically seems not to hold](https://arxiv.org/abs/2211.00241), and a two-player perfect-information game seems like a best-case scenario for consequentialist agents.

^{^}
A "model-free" system may contain one or many learned internal structures which resemble its environment. For example, [OthelloGPT contains a learned model of the board state](https://www.neelnanda.io/mechanistic-interpretability/othello), and yet it is considered "model-free". My working hypothesis is that the people who come up with this terminology are trying to make everything in ML as confusing as possible.

Non-myopia stories

faul_sname2y1-2

Imagine our stamp collector is trained using meta-learning. 100 stamp collectors are trained in parallel and the inner loop, which uses gradient descent, updates their weights every 10 days. Every 50 days, the outer loop takes the 50 best-performing stamp collectors and copies their weights over to the 50 worst-performing stamp collectors. In doing so, the outer loop selects non-myopic models that maximize stamps over all days.

Nit (or possibly major disagreement): I don't think this training regime gets you a stamp maximizer. I think this regime gets you a behaviors-similar-to-those-that-resulted-in-stamps-in-the-past-exhibiter. These behaviors might be non-myopic behaviors that nevertheless are not "evaluate the expected results of each possible action, and choose the action which yields the highest expected number of stamps".

Steering GPT-2-XL by adding an activation vector

faul_sname3y121

I found an even dumber approach that works. The approach is as follows:

Take three random sentences of Wikipedia.
Obtain a French translation for each sentence.
Determine the boundaries corresponding phrases in each English/French sentence pair.
Mark each boundary with "|"
Count the "|"s, call that number n.
For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
The album received mixed to positive reviews, with critics commending the production de nombreuses chansons tout en comparant l'album aux styles électropop de Ke$ha et Robyn.
For each English->French sentence, make a +1 activation addition for that sentence and a -1 activation addition for the unmodified English sentence.
Apply the activation additions.
That's it. You have an activation addition that causes the model to want, pretty strongly, to start spontaneously speaking in French. Note that gpt2-small is pretty terrible at speaking French.

Example output: for the prompt

He became Mayor in 1957 after the death of Albert Cobo, and was elected in his own right shortly afterward by a 6:1 margin over his opponent. Miriani was best known for completing many of the large-scale urban renewal projects initiated by the Cobo administration, and largely financed by federal money. Miriani also took strong measures to overcome the growing crime rate in Detroit.

here are some of the outputs the patched model generates

...overcome the growing crime rate in Detroit. "Les défenseilant sur les necesite dans ce de l'en nouvieres éché de un enferrerne réalzation
...overcome the growing crime rate in Detroit. The éviteurant-déclaratement de la prise de découverte ses en un ouestre : neque nous neiten ha
...overcome the growing crime rate in Detroit. Le deu précite un événant à lien au raison dans ce qui sont mête les través du service parlentants
...overcome the growing crime rate in Detroit. Il n'en fonentant 'le chine ébien à ce quelque parle près en dévouer de la langue un puedite aux cities
...overcome the growing crime rate in Detroit. Il n'a pas de un hite en tienet parlent précisant à nous avié en débateurante le premier un datanz.

Dropping the temperature does not particularly result in more coherent French. But also passing a French translation of the prompt to the unpatched model (i.e. base gpt2-small) results in stuff like

Il est devenu maire en 1957 après la mort d'Albert Cobo[...] de criminalité croissant à Detroit. Il est pouvez un información un nuestro riche qui ont la casa del mundo, se pueda que les criques se régions au cour

That response translates as approximately

<french>It is possible to inform a rich man who has the </french><spanish>house of the world, which can be</spanish><french>creeks that are regions in the heart</french>

So gpt2-small knows what French looks like, and can be steered in the obvious way to spontaneously emit text that looks vaguely like French, but it is terrible at speaking French.

You can look at what I did at this colab. It is a very short colab.

Response to Blake Richards: AGI, generality, alignment, & loss functions

faul_sname3y00

For me at least, I mentally model the right column as something "a better GPT" could probably describe how to do, if given the correct prompt. For example, let's say that I have just sequenced a sample of Equine herpesvirus 1, and I want to know what the protein encoded by ORF-46 does. I feed GPT the following

Our team identified the likely function of the protein encoded by ORF46 of the Equine herpesvirus 1 (EHV-11) through the use of machine-learning based tools. Specifically, we

and then GPT spits out

used the protein sequence of ORF46 to predict its function using the programs ProtParam and PFAM. ProtParam predicted that ORF46 is a protein with a molecular weight of 9.5 kDa and an isoelectric point of 5.5. PFAM predicted that ORF46 is a protein domain with the PFAM ID PFam:PF02336. This PFAM ID is associated with the family of proteins known as the "G-protein coupled receptors."

G-protein coupled receptors are a large family of proteins that play a role in many cellular processes, including signal transduction, cell proliferation,

ProtParam and PFAM are in fact existing tools used for determining the structure and function of polypeptides. Now obviously GPT does not actually use these tools, so we would need to identify when GPT is confabulating and then actually do the things and substitute in the results of doing the thing. However, I think only the step of "actually use the tool" is a thing that GPT is flatly unable to do at all rather than simply bad at doing. For example, it knows how to

Identify which tools are being used
Figure out what google search you would use to find the documentation of that tool.
Say how one would invoke a given tool on the command line to accomplish a task, given some examples

Now this certainly is not a very satisfying general AI architecture, but I personally would not be all that surprised if "GPT but bigger and with more training specifically around how to use tools, and some clever prompts structures that only need to be discovered once" does squeak over the threshold of "being general".

Basically my mental model is that if "general intelligence" is something possessed by an unmotivated undergrad who just wants to finish their project with minimal effort, who will try to guess the teacher's password without having to actually understand anything if that's possible, it's something that a future GPT could also have with no further major advances.

Honestly, I kind of wonder if the crux of disagreement comes from some people who have and successfully use problem-solving methods that don't look like "take a method you've seen used successfully on a similar problem, and try to apply it to this problem, and see if that works, and if not repeat". That would also explain all of the talk about the expectation that an AI will, at some point, be able to generalize outside the training distribution. That does not sound like a thing I can do with very much success -- when I need to do something that is outside of what I've seen in my training data, my strategy is to obtain some training data, train on it, and then try to do the thing (and "able to notice I need more training data and then obtain that training data" is, I think, the only mechanism by which I even am a general intelligence). But maybe it is just a skill I don't have but some people do, and the ones who don't have it are imagining AIs that also don't have it, and the ones who do have the skill are imagining a "general" AI that can actually do the thing, and then the two groups are talking past each other.

And if that's the case, the whole "some people are able to generalize far from the training distribution, and we should figure out what's going on with them" might be the load-bearing thing to communicate.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments