Alignment Stream of Thought

Wiki Contributions


I agree with the general point here but I think there's an important consideration that makes the application to RL algorithms less clear: wireheading is an artifact of embeddedness, and most RL work is in the non-embedded setting. Thus, it seems plausible that the development of better RL algorithms does in fact lead to the development of algorithms that would, if they were deployed in an embedded setting, wirehead.

I think of mesaoptimization as primarily being concerning because it would mean models (selected using amortized optimization) doing their own direct optimization, and the extent to which the model is itself doing its own "direct" optimization vs just being "amortized" is what I would call the optimizer-controller spectrum (see this post also).

Also, it seems kind of inaccurate to declare that (non-RL) ML systems are fundamentally amortized optimization and then to say things like "more computation and better algorithms should improve safety and the primary risk comes from misgeneralization" and "amortized approaches necessarily have poor sample efficiency asymptotically" and only adding in a mention about mesaoptimizers in a postscript. 

In my ontology, this corresponds to saying "current ML systems are very controller-y." But the thing I'm worried about is that eventually at some point we're going to figure out how to have models which in fact are more optimizer-y, for the same reasons people are trying to build AGI in the first place (though I do think there is a non-trivial chance that controller-y systems are in fact good enough to help us solve alignment, this is not something I'd bet on as a default!). 

Relatedly, it doesn't seem like because amortized optimizers are "just" modelling a distribution, this makes them inherently more benign. Everything can be phrased as distribution modelling! I think the core confusion here might be conflating the generalization properties of current ML architectures/methods with the type signature of ML. Moving the issue to data quality also doesn't fix the problem at all; everything is "just" a data problem, too (if only we had the dataset indicating exactly how the superintelligent AGI should solve alignment, then we could simply behavior clone that dataset).

I expect that the key externalities will be borne by society. The main reason for this is I expect deceptive alignment to be a big deal. It will at some point be very easy to make AI appear safe, by making it pretend to be aligned, and very hard to make it actually aligned. Then, I expect something like the following to play out (this is already an optimistic rollout intended to isolate the externality aspect, not a representative one):

We start observing alignment failures in models. Maybe a bunch of AIs do things analogous to shoddy accounting practices. Everyone says "yes, AI safety is Very Important". Someone notices that when you punish the AI for exhibiting bad behaviour with RLHF or something the AI stops exhibiting bad behaviour (because it's pretending to be aligned). Some people are complaining that this doesn't actually make it aligned, but they're ignored or given a token mention. A bunch of regulations are passed to enforce that everyone uses RLHF to align their models. People notice that alignment failures decrease across the board. The models don't have to somehow magically all coordinate to not accidentally reveal deception, because even in cases where models fail in dangerous ways people chalk this up to the techniques not being perfect, but they're being iterated on, etc. Heck, humans commit fraud all the time and yet it doesn't cause people to suddenly stop trusting everyone they know when a high profile fraud case is exposed. And locally there's always the incentive to just make the accounting fraud go away by applying Well Known Technique rather than really dig deep and figuring out why it's happening. Also, a lot of people will have vested interest in not having the general public think that AI might be deceptive, and so will try to discredit the idea as being fringe. Over time, AI systems control more and more of the economy. At some point they will control enough of the economy to cause catastrophic damage, and a treacherous turn happens.

At every point through this story, the local incentive for most businesses is to do whatever it takes to make the AI stop committing accounting fraud or whatever, not to try and stave off a hypothetical long term catastrophe. A real life example that this is analogous to is antibiotic overuse.

This story does hinge on "sweeping under the rug" being easier than actually properly solving alignment, but if deceptive alignment is a thing and is even moderately hard to solve properly then this seems very likely the case.

I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people

I predict that for most operationalizations of "actually hurt people", the result is that the right problems will not be paid attention to. And I don't expect lightning fast takeoff to be necessary. Again, in the case of climate change, which has very slow "takeoff", millions of people are directly impacted, and yet governments and major corporations move very slowly and mostly just say things about climate change mitigation being Very Important and doing token paper straw efforts. Deceptive alignment means that there is a very attractive easy option that makes the immediate crisis go away for a while.

But even setting aside the question of whether we should even expect to see warning signs, and whether deceptive alignment is a thing, I find it plausible that even the response to a warning sign that is as blatantly obvious as possible (an AI system tries to take over the world, fails, kills a bunch of people in the process) just results in front page headlines for a few days, some token statements, a bunch of political squabbling between people using the issue as a proxy fight for the broader "tech good or bad" narrative and a postmortem that results in patching the specific things that went wrong without trying to solve the underlying problem. (If even that; we're still doing gain of function research on coronaviruses!)

A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well.

Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like "well yes but this is just in a toy environment, and it's a big leap to it taking over the world", but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem ("chess was never actually hard!"), in my model it seems entirely plausible that every time we observe some alignment failure it updates a few people but most people remain un-updated. I predict that for a large set of things currently claimed will cause people to take alignment seriously, most of them will either be ignored by most people once they happen, or never happen before catastrophic failure.

We can also see analogous dynamics in i.e climate change, where even given decades of hard numbers and tangible physical phenomena large amounts of people (and importantly, major polluters) still reject its existence, many interventions are undertaken which only serve as lip service (greenwashing), and all of this would be worse if renewables were still economically uncompetitive.

I expect the alignment situation to be strictly worse because a) I expect the most egregious failures to only come shortly before AGI, so once evidence as robust as climate change (i.e literally catching AIs red handed trying and almost succeeding at taking over the world), I estimate we have anywhere between a few years and negative years left b) the space of ineffectual alignment interventions is far larger and harder to distinguish from real solutions to the underlying problem c) in particular, training away failures in ways that don't solve the underlying problems (i.e incentivizing deception) is an extremely attractive option and there does not exist any solution to this technical problem, and just observing the visible problems disappear is insufficient to distinguish whether the underlying problems are solved d) 80% of the tech for solving climate change basically already exists or is within reach, and society basically just has to decide that it cares, and the cost to society is legible. For alignment, we have no idea how to solve the technical problem, or even how that solution will vaguely look. This makes it a harder sell to society, e) the economic value of AGI vastly outweighs the value of fossil fuels, making the vested interest substantially larger, f) especially due to deceptive alignment, I expect actually-aligned systems to be strictly more expensive than unaligned systems; the cost will be more than just a fixed % more money, but also cost in terms of additional difficulty and uncertainty, time to market disadvantage, etc.

There's an example in the appendix but we didn't do a lot of qualitative analysis.

It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)

I generally agree that coupling is the main thing necessary for gradient hacking. However, from trying to construct gradient hackers by hand, my intuition is that gradient descent is just really good at credit assignment. For instance, in most reasonable architectures I don't think it's possible to have separate subnetworks for figuring out the correct answer and then just adding the coupling by gating it to save negentropy. To me, it seems the only kinds of strategies that could work are ones where the circuits implementing the cognition that decides to save negentropy are so deeply entangled with the ones getting the right answer that SGD can't separate them (or the ones that break gradient descent to make the calculated gradients inaccurate). I'm not sure if this is possible at all, and if it is, it probably relies on some gory details of how to trap SGD.

I don't think we can even conclude for certain that a lack of measured loglikelihood improvement implies that it won't, though it is evidence. Maybe the data used to measure the behavior doesn't successfully prompt the model to do the behavior, maybe it's phrased in a way the model recognizes as unlikely and so at some scale the model stops increasing likelihood on that sample, etc; as you would say, prompting can show presence but not absence.

Seems like there are multiple possibilities here:

  • (1) The AI does something that will, as an intended consequence, result in human extinction, because this is instrumental to preventing shutdown, etc. It attempts to circumvent our interpretability, oversight, etc. This is the typical deceptive alignment setting which is attempted to be addressed by myopia, interpretability, oversight, etc.
  • (2) The AI does something that will, as an unintended side consequence, result in human extinction. The AI also realizes that this is a consequence of its actions but doesn't really care. (This is within the "without ever explicitly thinking about the fact that humans are resisting it" scenario.) This is isomorphic to ELK.
    • If we can solve ELK, we can get the AI to tell us whether it thinks its plan will actually result in human extinction. This is the "oh yeah I am definitely a paperclipper" scenario. 
      • Also, if it has a model of the humans using ELK to determine whether to shut down the AI, the fact that it knows we will shut it off after we find out the consequences of its plan will incentivize it to either figure out how to implement plans that it itself cannot see how it will lead to human extinction (third scenario), or try to subvert our ability to turn it off after we learn of the consequences (first scenario).
    • If we can't solve ELK, we can get the AI to tell us something that doesn't really correspond to the actual internal knowledge inside the model. This is the "yup, it's just thinking about what text typically follows this question" scenario.
  • (3) The AI does something that will, as an unintended side consequence, result in human extinction. The AI does not realize this is a consequence of its actions, so solving ELK doesn't help us here. Failures of this type fall on a spectrum of how unforeseeable the consequences really are.
    • There are failures of this type that occur because the AI could have figured out its impact, but it was negligent. This is the "Hawaii Chaff Flower" scenario.
    • There are failures of this type that occur even if the AI tried its hardest to prevent harm to humans. These failures seem basically unavoidable even if alignment is perfectly solved, so this is mostly outside the realm of alignment. 

These posts are also vaguely related to the idea discussed in the OP (mostly looking at the problem of oversight being hard because of consequences in the world being hard to predict).

(Mostly just stating my understanding of your take back at you to see if I correctly got what you're saying:)

I agree this argument is obviously true in the limit, with the transistor case as an existence proof. I think things get weird at the in-between scales. The smaller the network of aligned components, the more likely it is to be aligned (obviously, in the limit if you have only one aligned thing, the entire system of that one thing is aligned); and also the more modular each component is (or I guess you would say the better the interfaces between the components), the more likely it is to be aligned. And in particular if the interfaces are good and have few weird interactions, then you can probably have a pretty big network of components without it implementing something egregiously misaligned (like actually secretly plotting to kill everyone).

And people who are optimistic about HCH-like things generally believe that language is a good interface and so conditional on that it makes sense to think that trees of humans would not implement egregiously misaligned cognition, whereas you're less optimistic about this and so your research agenda is trying to pin down the general theory of Where Good Interfaces/Abstractions Come From or something else more deconfusion-y along those lines.

Does this seem about right?

Load More