New Comment
7 comments, sorted by Click to highlight new comments since: Today at 10:29 PM

This is a list of random, assorted AI safety ideas that I think somebody should try to write up and/or work on at some point. I have a lot more than this in my backlog, but these are some that I specifically selected to be relatively small, single-post-sized ideas that an independent person could plausibly work on without much oversight. That being said, I think it would be quite hard to do a good job on any of these without at least chatting with me first—though feel free to message me if you’d be interested.

  • What would be necessary to build a good auditing game benchmark?
  • How would AI safety AI work? What is necessary for it to go well?
  • How do we avoid end-to-end training while staying competitive with it? Can we use transparency on end-to-end models to identify useful modules to train non-end-to-end?
  • What would it look like to do interpretability on end-to-end trained probabilistic models instead of end-to-end trained neural networks?
  • Suppose you had a language model that you knew was in fact a good generative model of the world and that this property continued to hold regardless of what you conditioned it on. Furthermore, suppose you had some prompt that described some agent for the language model to simulate (Alice) that in practice resulted in aligned-looking outputs. Is there a way we could use different conditionals to get at whether or not Alice was deceptive (e.g. prompt the model with “DeepMind develops perfect transparency tools and provides an opportunity for deceptive models to come clean and receive a prize before they’re discovered.”).
  • Argue for the importance of ensuring that the state-of-the-art in “using AI for alignment” never lags behind as a capability compared to where it could be given just additional engineering effort.
  • What does inner alignment look like in the context of models with access to memory (e.g. a retrieval database)?
  • Argue for doing scaling laws for phase changes. We have found some phase changes in models—e.g. the induction bump—but we haven’t yet really studied the extent to which various properties—e.g. Honesty—generalize across these sorts of phase changes.
  • Humans rewarding themselves for finishing their homework by eating candy suggests a plausible mechanism for gradient hacking.
  • If we see precursors to deception (e.g. non-myopia, self-awareness, etc.) but suspiciously don’t see deception itself, that’s evidence for deception.
  • The more model’s objectives vary depending on exact setup, randomness, etc., the less likely deceptive models are to want to cooperate with future deceptive models, thus making defection earlier more likely.
  • China is not a strategically relevant actor for AI, at least in short timeline scenarios—they are too far behind, their GDP isn’t growing fast enough, and their leaders aren’t very good at changing those things.
  • If you actually got a language model that was a true generative model of the world that you could get arbitrary conditionals from, that would be equivalent to having access to a quantum suicide machine.
  • Introduce the concept of how factored an alignment solution is in terms of how easy it is to turn up or down alignment relative to capabilities—or just swap out an aligned goal for a misaligned one—as an important axis to pay attention to. Currently, things are very factored—alignment and capabilities are both heavily dependent on dataset, reward, etc.—but that could change in the future.
  • Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.
  • How has transparency changed over time—Chris claims it's easier to interpret later models; is that true?
  • Which AI safety proposals are most likely to fail safely? Proposals which have the property that the most likely way for them to fail is just not to work are better than those that are most likely to fail catastrophically. In the former case, we’ve sacrificed some of our alignment tax, but still have another shot.
  • What are some plausible scenarios for how a model might be suboptimality deceptively aligned?
  • What can we learn about AI safety from the domestication of animals? Does the success of domesticating dogs from wolves provide an example of how to train for corrigibility? Or did we just make them dumber via the introduction of something like William’s syndrome?

I'll continue to include more directions like this in the comments here.

  • Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
  • Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
  • One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don't know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.
  • Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won't be), you get at most one extra bit of optimization towards alignment.
  • Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.
  • A deceptive model doesn't have to have some sort of very explicit check for whether it's in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it's in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn't really think about it very often because during training it just looks too unlikely.
  • Argue that wireheading, unlike many other reward gaming or reward tampering problems, is unlikely in practice because the model would have to learn to value the actual transistors storing the reward, which seems exceedingly unlikely in any natural environment.

Humans don't wirehead because reward reinforces the thoughts which the brain's credit assignment algorithm deems responsible for producing that reward. Reward is not, in practice, that-which-is-maximized -- reward is the antecedent-thought-reinforcer, it reinforces that which produced it. And when a person does a rewarding activity, like licking lollipops, they are thinking thoughts about reality (like "there's a lollipop in front of me" and "I'm picking it up"), and so these are the thoughts which get reinforced. This is why many human values are about latent reality and not about the human's beliefs about reality or about the activation of the reward system.

It seems that you're postulating that the human brain's credit assignment algorithm is so bad that it can't tell what high-level goals generated a particular action and so would give credit just to thoughts directly related to the current action. That seems plausible for humans, but my guess would be against for advanced AI systems.

No, I don't intend to postulate that. Can you tell me a mechanistic story of how better credit assignment would go, in your worldview?