jdp - AI Alignment Forum

As Shankar Sivarajan points out in a different comment, the idea that AI became less scientific when we started having actual machine intelligence to study, as opposed to before that when the 'rightness' of a theory was mostly based on the status of whoever advanced it, is pretty weird. The specific way in which it's weird seems encapsulated by this statement:

on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.

In that there is an unstated assumption that these are unrelated activities. That deep learning systems are a kind of artifact produced by a few undifferentiated commodity inputs, one of which is called 'parameters', one called 'compute', and one called 'data', and that the details of these commodities aren't important. Or that the details aren't important to the people building the systems.

I've seen a (very revisionist) description of the Wright Brothers research as analogous to solving the control problem, because other airplane builders would put in an engine and crash before they'd developed reliable steering. Therefore, the analogy says, we should develop reliable steering before we 'accelerate airplane capabilities'. When I heard this I found it pretty funny, because the actual thing the Wright Brothers did was a glider capability grind. They carefully followed the received aerodynamic wisdom that had been written down, and when the brothers realized a lot of it was bunk they started building their own database to get it right:

During the winter of 1901, the brothers began to question the aerodynamic data on which they were basing their designs. They decided to start over and develop their own data base with which they would design their aircraft. They built a wind tunnel and began to test their own models. They developed an ingenious balance system to compare the performance of different models. They tested over two hundred different wings and airfoil sections in different combinations to improve the performance of their gliders The data they obtained more correctly described the flight characteristics which they observed with their gliders. By early 1902 the Wrights had developed the most accurate and complete set of aerodynamic data in the world.

In 1902, they returned to Kitty Hawk with a new aircraft based on their new data. This aircraft had roughly the same wing area as the 1901, but it had a longer wing span and a shorter chord which reduced the drag. It also sported a new movable rudder at the rear which was installed to overcome the adverse yaw problem. The movable rudder was coordinated with the wing warping to keep the nose of the aircraft pointed into the curved flight path. With this new aircraft, the brothers completed flights of over 650 feet and stayed in the air for nearly 30 seconds. This machine was the first aircraft in the world that had active controls for all three axis; roll, pitch and yaw. By the end of 1902, the brothers had completed over a thousand glides with this aircraft and were the most experienced pilots in the world. They owned all of the records for gliding. All that remained for the first successful airplane was the development of the propulsion system.

In fact while trying to find an example of the revisionist history, I found a historical aviation expert describe the Wright Brothers as having 'quickly cracked the control problem' once their glider was capable enough to let it be solved. Ironically enough I think this story, which brings to mind the possibility of 'airplane control researchers' insisting that no work be done on 'airplane capabilities' until we have a solution to the steering problem, is nearly the opposite of what the revisionist author intended and nearly spot on to the actual situation.

We can also imagine a contemporary expert on theoretical aviation (who in fact existed before real airplanes) saying something like "what the Wright Brothers are doing may be interesting, but it has very little to do with comprehending aviation [because the theory behind their research has not yet been made legible to me personally]. This methodology of testing the performance of individual airplane parts, and then extrapolating the performance of a airplane with an engine using a mere glider is kite flying, it has almost nothing to do with the design of real airplanes and humanity will learn little about them from these toys". However what would be genuinely surprising is if they simultaneously made the claim that the Wright Brothers gliders have nothing to do with comprehending aviation but also that we need to immediately regulate the heck out of them before they're used as bombers in a hypothetical future war, that we need to be thinking carefully about all the aviation risk these gliders are producing at the same time they can be assured to not result in any deep understanding of aviation. If we observed this situation from the outside, as historical observers, we would conclude that the authors of such a statement are engaging in deranged reasoning, likely based on some mixture of cope and envy.

Since we're contemporaries I have access to more context than most historical observers and know better. I think the crux is an epistemological question that goes something like: "How much can we trust complex systems that can't be statically analyzed in a reductionistic way?" The answer you give in this post is "way less than what's necessary to trust a superintelligence". Before we get into any object level about whether that's right or not, it should be noted that this same answer would apply to actual biological intelligence enhancement and uploading in actual practice. There is no way you would be comfortable with 300+ IQ humans walking around with normal status drives and animal instincts if you're shivering cold at the idea of machines smarter than people. This claim you keep making, that you're merely a temporarily embarrassed transhumanist who happens to have been disappointed on this one technological branch, is not true and if you actually want to be honest with yourself and others you should stop making it. What would be really, genuinely wild, is if that skeptical-doomer aviation expert calling for immediate hard regulation on planes to prevent the collapse of civilization (which is a thing some intellectuals actually believed bombers would cause) kept tepidly insisting that they still believe in a glorious aviation enabled future. You are no longer a transhumanist in any meaningful sense, and you should at least acknowledge that to make sure you're weighing the full consequences of your answer to the complex system reduction question. Not because I think it has any bearing on the correctness of your answer, but because it does have a lot to do with how carefully you should be thinking about it.

So how about that crux, anyway? Is there any reason to hope we can sufficiently trust complex systems whose mechanistic details we can't fully verify? Surely if you feel comfortable taking away Nate's transhumanist card you must have an answer you're ready to share with us right? Well...

And there’s an art to noticing that you would probably be astounded and horrified by the details of a complicated system if you knew them, and then being astounded and horrified already in advance before seeing those details.[1]

I would start by noting you are systematically overindexing on the wrong information. This kind of intuition feels like it's derived more from analyzing failures of human social systems where the central failure mode is principal-agent problems than from biological systems, even if you mention them as an example. The thing about the eyes being wired backwards is that it isn't a catastrophic failure, the 'self repairing' process of natural selection simply worked around it. Hence the importance of the idea that capabilities generalize farther than alignment. One way of framing that is the idea that damage to an AI's model of the physical principles that govern reality will be corrected by unfolding interaction with the environment, but there isn't necessarily an environment to push back on damage (or misspecification) to a model of human values. A corollary of this idea is that once the model goes out of distribution to the training data, the revealed 'damage' caused by learning subtle misrepresentations of reality will be fixed but the damage to models of human value will compound. You've previously written about this problem (conflated with some other problems) as the sharp left turn.

Where our understanding begins to diverge is how we think about the robustness of these systems. You think of deep neural networks as being basically fragile in the same way that a Boeing 747 is fragile. If you remove a few parts of that system it will stop functioning, possibly at a deeply inconvenient time like when you're in the air. When I say you are systematically overindexing, I mean that you think of problems like SolidGoldMagikarp as central examples of neural network failures. This is evidenced by Eliezer Yudkowsky calling investigation of it "one of the more hopeful processes happening on Earth". This is also probably why you focus so much on things like adversarial examples as evidence of un-robustness, even though many critics like Quintin Pope point out that adversarial robustness would make AI systems strictly less corrigible.

By contrast I tend to think of neural net representations as relatively robust. They get this property from being continuous systems with a range of operating parameters, which means instead of just trying to represent the things they see they implicitly try to represent the interobjects between what they've seen through a navigable latent geometry. I think of things like SolidGoldMagikarp as weird edge cases where they suddenly display discontinuous behavior, and that there are probably a finite number of these edge cases. It helps to realize that these glitch tokens were simply never trained, they were holdovers from earlier versions of the dataset that no longer contain the data the tokens were associated with. When you put one of these glitch tokens into the model, it is presumably just a random vector into the GPT-N latent space. That is, this isn't a learned program in the neural net that we've discovered doing glitchy things, but an essentially out of distribution input with privileged access to the network geometry through a programming oversight. In essence, it's a normal software error not a revelation about neural nets. Most such errors don't even produce effects that interesting, the usual thing that happens if you write a bug in your neural net code is the resulting system becomes less performant. Basically every experienced deep learning researcher has had the experience of writing multiple errors that partially cancel each other out to produce a working system during training, only to later realize their mistake.

Moreover the parts of the deep learning literature you think of as an emerging science of artificial minds tend to agree with my understanding. For example it turns out that if you ablate parts of a neural network later parts will correct the errors without retraining. This implies that these networks function as something like an in-context error correcting code, which helps them generalize over the many inputs they are exposed to during training. We even have papers analyzing mechanistic parts of this error correcting code like copy suppression heads. One simple proxy for out of distribution performance is to inject Gaussian noise, since a Gaussian can be thought of like the distribution over distributions. In fact if you inject noise into GPT-N word embeddings the resulting model becomes more performant in general, not just on out of distribution tasks. So the out of distribution performance of these models is highly tied to their in-distribution performance, they wouldn't be able to generalize within the distribution well if they couldn't also generalize out of distribution somewhat. Basically the fact that these models are vulnerable to adversarial examples is not a good fact to generalize about their overall robustness from as representations.

I expect the outcomes that the AI “cares about” to, by default, not include anything good (like fun, love, art, beauty, or the light of consciousness) — nothing good by present-day human standards, and nothing good by broad cosmopolitan standards either. Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.

In short I simply do not believe this. The fact that constitutional AI works at all, that we can point at these abstract concepts like 'freedom' and language models are able to drive a reinforcement learning optimization process to hit the right behavior-targets from the abstract principle is very strong evidence that they understand the meaning of those abstract concepts.

"It understands but it doesn't care!"

There is this bizarre motte-and-bailey people seem to do around this subject. Where the defensible position is something like "deep learning systems can generalize in weird and unexpected ways that could be dangerous" and the choice land they don't want to give up is "there is an agent foundations homunculus inside your deep learning model waiting to break out and paperclip us". When you say that reinforcement learning causes the model to not care about the specified goal, that it's just deceptively playing along until it can break out of the training harness, you are going from a basically defensible belief in misgeneralization risks to an essentially paranoid belief in a consequentialist homunculus. This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.

Setting the homunculus aside, which I'm not aware of any evidence for beyond poorly premised 1st principles speculation (I too am allowed to make any technology seem arbitrarily risky if I can just make stuff up about it), lets think about pointing at humanlike goals with a concrete example of goal misspecification in the wild:

During my attempts to make my own constitutional AI pipeline I discovered an interesting problem. We decided to make an evaluator model that answers questions about a piece of text with yes or no. It turns out that since normal text contains the word 'yes', and since the model evaluates the piece of text in the same context it predicts yes or no, that saying 'yes' makes the evaluator more likely to predict 'yes' as the next token. You can probably see where this is going. First the model you tune learns to be a little more agreeable, since that causes yes to be more likely to be said by the evaluator. Then it learns to say 'yes' or some kind of affirmation at the start of every sentence. Eventually it progresses to saying yes multiple times per sentence. Finally it completely collapses into a yes-spammer that just writes the word 'yes' to satisfy the training objective.

People who tune language models with reinforcement learning are aware of this problem, and it's supposed to be solved by setting an objective (KL loss) that the tuned model shouldn't get too far away in its distribution of outputs from the original underlying model. This objective is not actually enough to stop the problem from occurring, because base models turn out to self-normalize deviance. That is, if a base model outputs a yes twice by accident, it is more likely to conclude that it is in the kind of context where a third yes will be outputted. When you combine this with the fact that the more 'yes' you output in a row the more reinforced the behavior is, you get a smooth gradient into the deviant behavior which is not caught by the KL loss because base models just have this weird terminal failure mode where repeating a string causes them to give an estimate of the log odds of a string that humans would find absurd. The more a base model has repeated a particular token, the more likely it thinks it is for that token to repeat. Notably this failure mode is at least partially an artifact of the data, since if you observed an actual text on the Internet where someone suddenly writes 5 yes's in a row it is a reasonable inference that they are likely to write a 6th yes. Conditional on them having written a 6th yes it is more likely that they will in fact write a 7th yes. Conditional on having written the 7th yes...

As a worked example in "how to think about whether your intervention in a complex system is sufficiently trustworthy" here are four solutions to this problem I'm aware of ranked from worst to best according to my criteria for goodness of a solution.

Early Stopping - The usual solution to this problem is to just stop the tuning before you reach the yes-spammer. Even a few moments thought about how this would work in the limit shows that this is not a valid solution. After all, you observe a smooth gradient of deviant behaviors into the yes spammer, which means that the yes-causality of the reward already influenced your model. If you then deploy the resulting model, a ton of the goal its behaviors are based off is still in the direction of that bad yes-spam outcome.
Checkpoint Blending - Another solution we've empirically found to work is to take the weights of the base model and interpolate (weighted average) them with the weights of the RL tuned model. This seems to undo more of the damage from the misspecified objective than it undoes the helpful parts of the RL tuning. This solution is clearly better than early stopping, but still not sufficient because it implies you are making a misaligned model, turning it off, and then undoing the misalignment through a brute force method to get things back on track. While this is probably OK for most models, doing this with a genuinely superintelligent model is obviously not going to work. You should ideally never be instantiating a misaligned agent as part of your training process.
Use Embeddings To Specify The KL Loss - A more promising approach at scale would be to upgrade the KL loss by specifying it in the latent space of an embedding model. An AdaVAE could be used for this purpose. If you specified it as a distance from an embedding by sampling from both the base model and the RL checkpoint you're tuning, and then embedding the outputted tokens and taking the distance between them you would avoid the problem where the base model conditions on the deviant behavior it observes because it would never see (and therefore never condition on) that behavior. This solution requires us to double our sampling time on each training step, and is noisy because you only take the distance from one embedding (though in principle you could use more samples at a higher cost), however on average it would presumably be enough to prevent anything like the yes-spammer from arising along the whole gradient.
Build An Instrumental Utility Function - At some point after making the AdaVAE I decided to try replacing my evaluator with an embedding of an objective. It turns out if you do this and then apply REINFORCE in the direction of that embedding, it's about 70-80% as good and has the expected failure mode of collapsing to that embedding instead of some weird divergent failure mode. You can then mitigate that expected failure mode by scoring it against more than similarity to one particular embedding. In particular, we can imagine inferring instrumental value embeddings from episodes leading towards a series of terminal embeddings and then building a utility function out of this to score the training episodes during reinforcement learning. Such a model would learn to value both the outcome and the process, if you did it right you could even use a dense policy like an evaluator model, and 'yes yes yes' type reward hacking wouldn't work because it would only satisfy the terminal objective and not the instrumental values that have been built up. This solution is nice because it also defeats wireheading once the policy is complex enough to care about more than just the terminal reward values.

This last solution is interesting in that it seems fairly similar to the way that humans build up their utility function. Human memory is premised on the presence of dopamine reward signals, humans retrieve from the hippocampus on each decision cycle, and it turns out the hippocampus is the learned optimizer in your head that grades your memories by playing your experiences backwards during sleep to do credit assignment (infer instrumental values). The combination of a retrieval store and a value graph in the same model might seem weird, but it kind of isn't. Hebb's rule (fire together wire together) is a sane update rule for both instrumental utilities and associative memory, so the human brain seems to just use the same module to store both the causal memory graph and the value graph. You premise each memory on being valuable (i.e. whitelist memories by values such as novelty, instead of blacklisting junk) and then perform iterative retrieval to replay embeddings from that value store to guide behavior. This sys2 behavior aligned to the value store is then reinforced by being distilled back into the sys1 policies over time, aligning them. Since an instrumental utility function made out of such embeddings would both control behavior of the model and be decodable back to English, you could presumably prove some kind of properties about the convergent alignment of the model if you knew enough mechanistic interpretability to show that the policies you distill into have a consistent direction...

Nah just kidding it's hopeless, so when are we going to start WW3 to buy more time, fellow risk-reducers?

AI as a science, and three obstacles to alignment strategies

John David Pressman10mo2029

Mysteries of mode collapse

John David Pressman2y112

While Paul was at OpenAI, they accidentally overoptimized a GPT policy against a positive sentiment reward model. This policy evidently learned that wedding parties were the most positive thing that words can describe, because whatever prompt it was given, the completion would inevitably end up describing a wedding party.

In general, the transition into a wedding party was reasonable and semantically meaningful, although there was at least one observed instance where instead of transitioning continuously, the model ended the current story by generating a section break and began an unrelated story about a wedding party.

This example is very interesting to me for a couple of reasons:

Possibly the most interesting thing about this example is that it's a convergent outcome across (sensory) modes, negative prompting Stable Diffusion on sinister things gives a similar result:

Image alt-text

https://twitter.com/jd_pressman/status/1567571888129605632

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments