Working on alignment at EleutherAI


Alignment Stream of Thought

Wiki Contributions


I generally agree that coupling is the main thing necessary for gradient hacking. However, from trying to construct gradient hackers by hand, my intuition is that gradient descent is just really good at credit assignment. For instance, in most reasonable architectures I don't think it's possible to have separate subnetworks for figuring out the correct answer and then just adding the coupling by gating it to save negentropy. To me, it seems the only kinds of strategies that could work are ones where the circuits implementing the cognition that decides to save negentropy are so deeply entangled with the ones getting the right answer that SGD can't separate them (or the ones that break gradient descent to make the calculated gradients inaccurate). I'm not sure if this is possible at all, and if it is, it probably relies on some gory details of how to trap SGD.

I don't think we can even conclude for certain that a lack of measured loglikelihood improvement implies that it won't, though it is evidence. Maybe the data used to measure the behavior doesn't successfully prompt the model to do the behavior, maybe it's phrased in a way the model recognizes as unlikely and so at some scale the model stops increasing likelihood on that sample, etc; as you would say, prompting can show presence but not absence.

Seems like there are multiple possibilities here:

  • (1) The AI does something that will, as an intended consequence, result in human extinction, because this is instrumental to preventing shutdown, etc. It attempts to circumvent our interpretability, oversight, etc. This is the typical deceptive alignment setting which is attempted to be addressed by myopia, interpretability, oversight, etc.
  • (2) The AI does something that will, as an unintended side consequence, result in human extinction. The AI also realizes that this is a consequence of its actions but doesn't really care. (This is within the "without ever explicitly thinking about the fact that humans are resisting it" scenario.) This is isomorphic to ELK.
    • If we can solve ELK, we can get the AI to tell us whether it thinks its plan will actually result in human extinction. This is the "oh yeah I am definitely a paperclipper" scenario. 
      • Also, if it has a model of the humans using ELK to determine whether to shut down the AI, the fact that it knows we will shut it off after we find out the consequences of its plan will incentivize it to either figure out how to implement plans that it itself cannot see how it will lead to human extinction (third scenario), or try to subvert our ability to turn it off after we learn of the consequences (first scenario).
    • If we can't solve ELK, we can get the AI to tell us something that doesn't really correspond to the actual internal knowledge inside the model. This is the "yup, it's just thinking about what text typically follows this question" scenario.
  • (3) The AI does something that will, as an unintended side consequence, result in human extinction. The AI does not realize this is a consequence of its actions, so solving ELK doesn't help us here. Failures of this type fall on a spectrum of how unforeseeable the consequences really are.
    • There are failures of this type that occur because the AI could have figured out its impact, but it was negligent. This is the "Hawaii Chaff Flower" scenario.
    • There are failures of this type that occur even if the AI tried its hardest to prevent harm to humans. These failures seem basically unavoidable even if alignment is perfectly solved, so this is mostly outside the realm of alignment. 

These posts are also vaguely related to the idea discussed in the OP (mostly looking at the problem of oversight being hard because of consequences in the world being hard to predict).

(Mostly just stating my understanding of your take back at you to see if I correctly got what you're saying:)

I agree this argument is obviously true in the limit, with the transistor case as an existence proof. I think things get weird at the in-between scales. The smaller the network of aligned components, the more likely it is to be aligned (obviously, in the limit if you have only one aligned thing, the entire system of that one thing is aligned); and also the more modular each component is (or I guess you would say the better the interfaces between the components), the more likely it is to be aligned. And in particular if the interfaces are good and have few weird interactions, then you can probably have a pretty big network of components without it implementing something egregiously misaligned (like actually secretly plotting to kill everyone).

And people who are optimistic about HCH-like things generally believe that language is a good interface and so conditional on that it makes sense to think that trees of humans would not implement egregiously misaligned cognition, whereas you're less optimistic about this and so your research agenda is trying to pin down the general theory of Where Good Interfaces/Abstractions Come From or something else more deconfusion-y along those lines.

Does this seem about right?

I agree that in practice you would want to point mild optimization at it, though my preferred resolution (for purely aesthetic reasons) is to figure out how to make utility maximizers that care about latent variables, and then make it try to optimize the latent variable corresponding to whatever the reflection converges to (by doing something vaguely like logical induction). Of course the main obstacles are how the hell we actually do this, and how we make sure the reflection process doesn't just oscillate forever.

(Transcribed in part from Eleuther discussion and DMs.)

My understanding of the argument here is that you're using the fact that you care about diamonds as evidence that whatever the brain is doing is worth studying, with the hope that it might help us with alignment. I agree with that part. However, I disagree with the part where you claim that things like CIRL and ontology identification aren't as worthy of being elevated to consideration. I think there exist lines of reasoning that these fall naturally out as subproblems, and the fact that they fall out of these other lines of reasoning promotes them to the level of consideration.

I think there are a few potential cruxes of disagreement from reading the posts and our discussion:

  • You might be attributing far broader scope to the ontology identification problem than I would; I think of ontology identification as an interesting subproblem that recurs in a lot of different agendas, that we may need to solve in certain worst plausible cases / for robustness against black swans.
  • In my mind ontology identification is one of those things where it could be really hard worst case or it could be pretty trivial, depending on other things. I feel like you're pointing at "humans can solve this in practice" and I'm pointing at "yeah but this problem is easy to solve in the best case and really hard to solve in the worst case."
  • More broadly, we might disagree on how scalable certain approaches used in humans are, or how surprising it is that humans solve certain problems in practice. I generally don't find arguments about humans implementing a solution to some hard alignment problem compelling, because almost always when we're trying to solve the problem for alignment we're trying to come up with an airtight robust solution, and humans implement the kludgiest, most naive solution that works often enough
  • I think you're attributing more importance to the "making it care about things in the real world, as opposed to wireheading" problem than I am. I think of this as one subproblem of embeddedness that might turn out to be difficult, that falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix. This applies to shard theory more broadly.

I also think the criticism of invoking things like AIXI-tl is missing the point somewhat. As I understand it, the point that is being made when people think about things like this is that nobody expects AGI to actually look like AIXI-tl or be made of Bayes nets, but this is just a preliminary concretization that lets us think about the problem, and substituting this is fine because this isn't core to the phenomenon we're poking at (and crucially the core of the thing that we're pointing at is something very limited in scope, as I listed in one of the cruxes above). As an analogy, it's like thinking about computational complexity by assuming you have an infinitely large Turing machine and pretending coefficients don't exist or something, even though real computers don't look remotely like that. My model of you is saying "ah, but it is core, because humans don't fit into this framework and they solve the problem, so by restricting yourself to this rigid framework you exclude the one case where it is known to be solved." To which I would point to the other crux and say "au contraire, humans do actually fit into this formalism, it works in humans because humans happen to be the easy case, and this easy solution generalizing to AGI would exactly correspond to scanning AIXI-tl's Turing machines for diamond concepts just working without anything special." (see also: previous comments where I explain my views on ontology identification in humans).

(Partly transcribed from a correspondence on Eleuther.)

I disagree about concepts in the human world model being inaccessible in theory to the genome. I think lots of concepts could be accessed, and that (2) is true in the trilemma.

Consider: As a dumb example that I don't expect to actually be the case but which gives useful intuition, suppose the genome really wants to wire something up to the tree neuron. Then the genome could encode a handful of images of trees and then once the brain is fully formed it can go through and search for whichever neuron activates the hardest on those 10 images. (Of course it wouldn't actually do literal images, but I expect compressing it down to not actually be that hard.) The more general idea is that we can specify concepts in the world model extensionally by specifying constraints that the concept has to satisfy (for instance, it should activate on these particular data points, or it should have this particular temporal consistency, etc.) Keep in mind this means that the genome just has to vaguely gesture at the concept, and not define the decision boundary exactly.

If this sounds familiar, that's because this basically corresponds to the naivest ELK solution where you hope the reporter generalizes correctly. This probably even works for lots of current NNs. The fact that this works in humans and possibly current NNs, though, is not really surprising to me, and doesn't necessarily imply that ELK continues to work in superintelligence. In fact, to me, the vast majority of the hardness of ELK is making sure it continues to work up to superintelligence/arbitrarily weird ontologies. One can argue for natural abstractions, but that would be an orthogonal argument to the one made in this post. This is why I think (2) is true, though I think the statement would be more obvious if stated as "the solution in humans doesn't scale" rather than "can't be replicated".

Note: I don't expect very many things like this to be hard coded; I expect only a few things to be hard coded and a lot of things to result as emergent interactions of those things. But this post is claiming that the hard coded things can't reference concepts in the world model at all.

As for more abstract concepts: I think encoding the concept of, say, death, is actually extremely doable extensionally. There are a bunch of ways we can point at the concept of death relative to other anticipated experiences/concepts (i.e the thing that follows serious illness and pain, unconsciousness/the thing that's like dreamless sleep, the thing that we observe happens to other beings that causes them to become disempowered, etc). Anecdotally, people do seem to be afraid of death in large part because they're afraid of losing consciousness, the pain that comes before it, the disempowerment of no longer being able to affect things, etc. Again, none of these things have to be exactly pointing to death; they just serve to select out the neuron(s) that encode the concept of death. Further evidence for this theory includes the fact that humans across many cultures and even many animals pretty reliably develop an understanding of death in their world models, so it seems plausible that evolution would have had time to wire things up, and it's a fairly well known phenomenon that very small children who don't yet have well formed world models tend to endanger themselves with seemingly no fear of death. This all also seems consistent with the fact that lots of things we seem fairly hardwired to care about (i.e death, happiness, etc) splinter; we're wired to care about things as specified by some set of points that were relevant in the ancestral environment, and the splintering is because those points don't actually define a sharp decision boundary.

As for why I think more powerful AIs will have more alien abstractions: I think that there are many situations where the human abstractions are used because they are optimal for a mind with our constraints. In some situations, given more computing power you ideally want to model things at a lower level of abstraction. If you can calculate how the coin will land by modelling the air currents and its rotational speed, you want to do that to predict exactly the outcome, rather than abstracting it away as a Bernoulli process. Conversely, sometimes there are high levels of abstraction that carve reality at the joints that require fitting too much stuff in your mind at once, or involve regularities of the world that we haven't discovered yet. Consider how having an understanding of thermodynamics lets you predict macroscopic properties of the system, but only if you already know about and are capable of understanding it. Thus, it seems highly likely that a powerful AI would develop very weird abstractions from our perspective. To be clear, I still think natural abstractions is likely enough to be true that it's worth elevating as a hypothesis under consideration, and a large part of my remaining optimism lies there, but I don't think it's automatically true at all.

Computationally expensive things are less likely to show up in your simulation than the real world, because you only have so much compute to run your simulation. You can't convincingly fake the AI having access to a supercomputer.

The possibility is that Alice might always be able tell that she’s in a simulation no matter what we condition on. I think this is pretty much precluded by the assumption that the generative model is a good model of the world, but if that fails then it’s possible Alice can tell she’s in a simulation no matter what we do. So a lot rides on the statement that the generative model remains a good model of the world regardless of what we condition on.

Paul's RSA-2048 counterexample is an example of a way our generative model can fail to be good enough no matter how hard we try. The core idea is that there exist things that are extremely computationally expensive to fake and very cheap to check the validity of, so faking them convincingly will be extremely hard.

Liked this post a lot. In particular I think I strongly agree with "Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument" as the general vibe of how I feel about Eliezer's arguments. 

A few comments on the disagreements:

Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.”

An in-between position would be to argue that even if we're maximally competent at the institutional problem, and can extract all the information we possibly can through experimentation before the first critical try, that just prevents the really embarrassing failures. Irrecoverable failures could still pop up every once in a while after entering the critical regime that we just could not have been prepared for, unless we have a full True Name of alignment. I think the crux here depends on your view on the Murphy-constant of the world (i.e how likely we are to get unknown unknown failures), and how long you think we need to spend in the critical regime before our automated alignment research assistants solve alignment.

By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill humans in more boring ways or else radically advanced the state of human R&D.

For what it's worth I think the level of tech needed to overpower humans in more boring ways is a substantial part of my "doom cinematic universe" (and I usually assume nanobots is meant metaphorically). In particular, I think it's plausible the "slightly-less-impressive-looking" systems that come before the first x-risk AI are not going to be obviously step-before-x-risk any more so than current scary capabilities advances, because of uncertainty over its exact angle (related to Murphy crux above) + discontinuous jumps in specific capabilities as we see currently in ML.

if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff.

SGD is definitely far from perfect optimization, and it seems plausible that if concealment against SGD is a thing at all, then it would be due to some kind of instrumental thing that a very large fraction of powerful AI systems converge on.

Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research

I think there's a lot of different cruxes hiding inside the question of how AI acceleration of alignment research interacts with P(doom), including how hard alignment is, and whether AGI labs will pivot to focus on alignment (some earlier thoughts here), even assuming we can align the weak systems used for this. Overall I feel very uncertain about this.

Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects

Explicitly registering agreement with this prediction.

Eliezer is relatively confident that you can’t train powerful systems by imitating human thoughts, because too much of human thinking happens under the surface.

Fwiw, I interpreted this as saying that it doesn't work as a safety proposal (see also: my earlier comment). Also seems related to his arguments about ML systems having squiggles.

Load More