Wiki Contributions


On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:

  1. Practical considerations: AI systems currently tend to require lots of examples and it's expensive to get these if they all have to be provided by a human.
  2. Some actions look good to a casual human observer, but are actually bad on closer inspection. The AI would be rewarded for finding and taking such actions.
  3. If you're training a neural network, then there are generically going to be lots of adversarial examples for that network. As the AI gets more and more powerful, we'd expect it to be able to generate more and more situations where its learned value function gives a high reward but a human would give a low reward. So it seems like we end up playing a game of adversarial example whack-a-mole for a long time, where we're just patching hole after hole in this million-dimensional bucket with thousands of holes. Probably the AI manages to kill us before that process converges.
  4. To make the above worse, there's this idea of a sharp left turn, where a sufficiently intelligent AI can think of very weird plans that go far outside of the distribution of scenarios that it was trained on. We expect generalization to get worse in this regime, and we also expect an increased frequency of adversarial examples. (What would help a lot here is designing the AI to have an interpretable planning system, where we could run these plans forward and negatively reinforce the bad ones (and maybe all the weird ones, because of corrigibility reasons, though we'd have to be careful about how that's formulated because we don't want the AI trying to kill us because it thinks we'd produce a weird future).)
  5. Once the AI is modelling reality in detail, its reward function is going to focus on how the rewards are actually being piped to the AI, rather than the human evaluator's reaction, let alone of some underlying notion of goodness. If the human evaluators just press a button to reward the AI for doing a good thing, the AI will want to take control of that button and stick a brick on top of it.

On training models to assist in human evaluation and point out flaws in AI outputs: Doing this is probably somewhat better than not doing it, but I'm pretty skeptical that it provides much value:

  1. The AI can try and fool the critic just like it would fool humans. It doesn't even need a realistic world model for this, since using the critic to inform the training labels leaks information about the critic to the AI.
  2. It's therefore very important that the critic model generates all the strong and relevant criticisms of a particular AI output. Otherwise the AI could just route around the critic.
  3. On some kinds of task, you'll have an objective source of truth you can train your model on. The value of an objective source of truth is that we can use it to generate a list of all the criticisms the model should have made. This is important because we can update the weights of the critic model based on any criticisms it failed to make. On other kinds of task, which are the ones we're primarily interested in, it will be very hard or impossible to get the ground truth list of criticisms. So we won't be able to update the weights of the model that way when training. So in some sense, we're trying to generalize this idea of "a strong a relevant criticism" between these different tasks of differing levels of difficulty.
  4. This requirement of generating all criticisms seems very similar to the task of getting a generative model to cover all modes. I guess we've pretty much licked mode collapse by now, but "don't collapse everything down to a single mode" and "make sure you've got good coverage of every single mode in existence" are different problems, and I think the second one is much harder.

On using AI systems, in particular large language models, to advance alignment research: This is not going to work.

  1. LLMs are super impressive at generating text that is locally coherent for a much broader definition of "local" than was previously possible. They are also really impressive as a compressed version of humanity's knowledge. They're still known to be bad at math, at sticking to a coherent idea and at long chains of reasoning in general. These things all seem important for advancing AI alignment research. I don't see how the current models could have much to offer here. If the thing is advancing alignment research by writing out text that contains valuable new alignment insights, then it's already pretty much a human-level intelligence. We talk about AlphaTensor doing math research, but even AlphaTensor didn't have to type up the paper at the end!
  2. What could happen is that the model writes out a bunch of alignment-themed babble, and that inspires a human researcher into having an idea, but I don't think that provides much acceleration. People also get inspired while going on a walk or taking a shower.
  3. Maybe something that would work a bit better is to try training a reinforcement-learning agent that lives in a world where it has to solve the alignment problem in order to achieve its goals. Eg. in the simulated world, your learner is embodied in a big robot, and it there's a door in the environment it can't fit through, but it can program a little robot to go through the door and perform some tasks for it. And there's enough hidden information and complexity behind the door that the little robot needs to have some built-in reasoning capability. There's a lot of challenges here, though. Like how do you come up with a programming environment that's simple enough that the AI can figure out how to use it, while still being complex enough that the little robot can do some non-trivial reasoning, and that the AI has a chance of discovering a new alignment technique? Could be it's not possible at all until the AI is quite close to human-level.

I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.

This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We'd ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we'd want them to.

7: Did I forget some important question that someone will ask in the comments?


Is there a way to deal with the issue of there being multiple ROSE points in some games? If Alice says "I think we should pick ROSE point A" and Bob says "I think we should pick ROSE point B", then you've still got a bargaining game left to resolve, right?

Anyways, this is an awesome post, thanks for writing it up!

  1. Modelling humans as having free will: A peripheral system identifies parts of the agent's world model that are probably humans. During the planning phase, any given plan is evaluated twice: The first time as normal, the second time the outputs of the human part of the model are corrupted by noise. If the plan fails the second evaluation, then it probably involves manipulating humans and should be discarded.

Posting this comment to start some discussion about generalization and instrumental convergence (disagreements #8 and #9).

So my general thoughts here are that ML generalization is almost certainly not good enough for alignment. (At least in the paradigm of deep learning.) I think it's true with high confidence that if we're trying to train a neural net to imitate some value function, and that function takes a high-dimensional input, then it will be possible to find lots of inputs that cause the network to produce a high value when the value function produces a low one, or vice-versa. In other words, adversarial inputs exist. This is true even when the function is simple enough that the network certainly has more than enough Bayesian evidence to pin down the function. As far as I know, we haven't yet discovered a way to really fix this problem, though there are certainly ways to make the adversarial inputs a little more rare/harder to find.

Paul also mentions that high intelligence isn't a separate regime that the AI needs to generalize to, but rather that the AI can be trained continuously as its intelligence increases. I agree with this, but I don't think it constitutes a valid objection, since the regimes that we actually want the AI to generalize between are cooperation and betrayal. Generally these would look pretty different, with betrayal plans involving the AI tiling space with adversarial examples, etc. And we'd generally expect a discontinuous switch to betrayal only when the AI is confident it can win, so there's not really an opportunity to train the AI on betrayal examples beforehand.

Yes, sounds right to me. It's also true that one of the big unproven assumptions here is that we could create an AI strong enough to build such a tool, but too weak to hack humans. I find it plausible, personally, but I don't yet have an easy-to-communicate argument for it.

Okay, I will try to name a strong-but-checkable pivotal act.

(Having a strong-but-checkable pivotal act doesn't necessarily translate into having a weak pivotal act. Checkability allows us to tell the difference between a good plan and a trapped plan with high probability, but the AI has no reason to give us a good plan. It will just produce output like "I have insufficient computing power to solve this problem" regardless of whether that's actually true. If we're unusually successful at convincing the AI our checking process is bad when it's actually good, then that AI may give us a trapped plan, which we can then determine is trapped. Of course, one should not risk executing a trapped plan, even if one thinks one has identified and removed all the traps. So even if #30 is false, we are still default-doomed. (I'm not fully certain that we couldn't create some kind of satisficing AI that gets reward 1 if it generates a safe plan, reward 0 if its output is neither helpful nor dangerous, and reward -1 if it generates a trapped plan that gets caught by our checking process. The AI may then decide that it has a higher chance of success if it just submits a safe plan. But I don't know how one would train such a satisficer with current deep learning techniques.))

The premise of this pivotal act is that even mere humans would be capable of designing very complex nanomachines, if only they could see the atoms in front of them, and observe the dynamics as they bounce and move around on various timescales. Thus, the one and only output of the AI will be the code for fast and accurate simulation of atomic-level physics. Being able to get quick feedback on what would happen if you designed such-and-such a thing not only helps with being able to check and iterate designs quickly, it means that you can actually do lots of quick experiments to help you intuitively grok the dynamics of how atoms move and bond.

This is kind of a long comment, and I predict the next few paragraphs will be review for many LW readers, so feel free to skip to the paragraph starting with "SO HOW ARE YOU ACTUALLY GOING TO CHECK THE MOLECULAR DYNAMICS SIMULATION CODE?".

Picture a team of nano-engineers designing some kind of large and complicated nanomachine. Each engineer wears a VR headset so they can view the atomic structure they're working on in 3d, and each has VR gloves with touch-feedback so they can manipulate the atoms around. The engineers all work on various components of the larger nanomachine that must be built. Often there are standardized interfaces for transfer of information, energy, or charge. Other times, interfaces must be custom-designed for a particular purpose. Each component might connect to several of these interfaces, as well as being physically connected to the larger structure.

The hardest part of nanomachines is probably going to be the process of actually manufacturing them. The easiest route from current technology is to take advantage of our existing DNA synthesis tech to program ribosomes to produce the machines we want. The first stage of machines would be made from amino acids, but from there we could build machines that built other machines and bootstrap our way up to being able to build just about anything. This bootstrapping process would be more difficult than the mere design process for the final machine, and the first stage where we have to build things entirely out of amino acids sounds particularly brutal. But just as people could write the first compilers in machine code, it should be possible to figure out how to get things done even in the insanely constrained domain of amino acids. And this step only has to be done once before a whole world opens up.

The first obvious question is "what about quantum mechanics?". The answer is that we don't care too much about it. It makes computing the dynamics harder of course, but most systems interact frequently with their environment, and most nanomachines will probably interact with their environment frequently enough that they behave nearly classically. QM is important for determining what local chemical reactions take place, but there's no long-range entanglement to worry about. That's also helpful for allowing the human engineers to get a handle on the dynamics. The main effect of the frequent interaction with the environment is that the dynamics becomes somewhat stochastic. Often you will see engineers run the same simulation several times, so they can understand the distribution of possible outcomes. As the final design is approached, the engineers run it through thousands of simulations so that even rare thermal fluctuations are accounted for.

The zeroth obvious question is "what will you build with it?". The answer is that a variety of things could be built, and there might be many side-projects produced to raise funds for the main project, but the main project is a scanner that can read a cryonics patient's brain one layer at a time in sufficient detail to faithfully emulate them. This is a pretty standard pivotal act, and has been written about elsewhere, but the main idea is that we can freeze, scan, and maybe copy a bunch of alignment researchers and other smart folks to solve alignment at a much faster speed than the outside world is working on their unaligned AIs. Or, the researchers might be able to foom while maintaining their values by making careful changes to their own minds. Besides the scanner itself, this project would probably need to design extremely fast special-purpose computing hardware for running the ems, along with various kinds of tiny probes and other experimental equipment to collect data on how a scan translates to brain-function. (An ordinary dumb neural network should be sufficient for this kind of inference problem.)

SO HOW ARE YOU ACTUALLY GOING TO CHECK THE MOLECULAR DYNAMICS SIMULATION CODE? There are two possible attack modes. One is to exploit any possible vulnerabilities in the hardware running the code to achieve unwanted side effects, and use those side effects to escape from containment. The second is to somehow make the output of the simulation subtly flawed, such that the humans, working off of flawed data, accidentally design a nanomachine that lets the AI out of the box instead of doing what we were designing that machine to do. Obviously the first attack mode is much more desirable for the AI than the second. Actually, the second attack mode may be so hard to pull off that we don't need to check for it at all, but not checking for things when you have the capability to check for them is silly.

Checks for false physics that might trick the humans into building trapped nanomachines:

  • Compare with the real dynamics, which we know, but is slower for us to simulate. In real physics, we can break down the world into little cubes and the dynamics are symmetrical around translating those cubes in space and time. So if we request a simulation method that has those same symmetries, we can check the simulation on small length and time scales, which are feasible for us to simulate. This doesn't guarantee that a larger simulation would be correct, since small errors can multiply out of control as the simulation size gets larger, but it does mean that the dynamics of the large simulation have to bear a pretty-close resemblance to real physics.
  • We can also request that the code provably obeys conservation of momentum, energy, charge, etc.
  • Use diff fuzzing techniques: This is where you use a fuzzer (implemented with techniques like GOFAI and dumb neural networks) to try and find inputs to the simulation such that the output of the AI's code differs from the output of our slow-but-known-correct code. The stronger the optimizer we use here, the harder it is for the AI to avoid detection.
  • Request that the AI give us error bounds on the simulation, along with machine-checkable proofs that they hold.
  • Side channel attack and its mitigation: The AI can steer us away from certain designs and towards others by making the simulation take different amounts of time depending on the configuration of atoms we submit. Therefore we request code that takes a constant amount of time to run, plus proof that it does so. In order to make this easier, the code will be run on a GPU architecture with cache levels explicitly managed in the code, and where none of the cores have branch prediction.

The checks for the first attack method are pretty much just normal computer security, but at extremely high stakes. Just so we're clear on what the picture is like here: Generally the AI is not going to know in perfect detail what the hardware is like that its simulation code will be running on, though we at least have to give it the instruction set, or some high-level language that compiles to that instruction set. (Probably the high level language is the better option. That way we can design the language so that properties like constant run-time and space-time symmetry are gauranteed-true of all valid programs in that language.) So the best technique for the AI will be to submit simulation code that has a hidden optimizer embedded into it. This optimizer then searches for possible escape methods while it runs. Even though different runs of the code don't share memory, they'll have different inputs, which the optimizer can use as a random seed to determine which part of the escape-strategy search space it should be trying. If there are any holes at all in our security, the AI will be able to find and exploit them. All those engineers in VR headsets running simulations? In each case, the computer running the simulation has a hostile AI on it, trying everything it can think of to get out. In general, every aspect of the system must be proved correct, from the hardware, to the high-level language we give to the AI. In addition, everything must be thoroughly unit tested, integration tested, fuzz tested, and given to the best security researchers who will try to find side-channel attacks, etc. Other people will write a proof that's as close as humanly possible to an actual mathematical proof that no side channel attacks are possible, and then everyone will try everything they can to poke holes in that proof.

Okay, yeah, we get it, you're going to try really really hard. Don't you know that the universe doesn't owe you success just because of how much you tried? No matter what security you put in place as a mere human, the AI is going to get around it just because it's so much smarter. Even a superintelligent AI would not be able to break a one time pad. Depending on the details of computational complexity, it probably wouldn't be able to crack our best symmetric ciphers today either. For a superintelligent AI to succeed at a task, that task must be actually possible. If we can actually design a system with no holes, which takes a large but finite amount of work and intelligence, then a superintelligence will not be able to break it, no matter how much smarter than us it is.

Thanks for writing this. I agree with all of these except for #30, since it seems like checking the output of the AI for correctness/safety should be possible even if the AI is smarter than us, just like checking a mathematical proof can be much easier than coming up with the proof in the first place. It would take a lot of competence, and a dedicated team of computer security / program correctness geniuses, but definitely seems within human abilities. (Obviously the AI would have to be below the level of capability where it can just write down an argument that convinces the proof checkers to let it out of the box. This is a sense in which having the AI produce uncommented machine code may actually be safer than letting it write English at us.)

The Ultimatum game seems like it has pretty much the same type signature as the prisoner's dilemma: Payoff matrix for different strategies, where the players can roll dice to pick which strategy they use. Does timeless decision theory return the "correct answer" (second player rejects greedy proposals with some probability) when you feed it the Ultimatum game?

Large genomes have (at least) 2 kinds of costs. The first is the energy and other resources required to copy the genome whenever your cells divide. The existence of junk DNA suggests that this cost is not a limiting factor. The other cost is that a larger genome will have more mutations per generation. So maintaining that genome across time uses up more selection pressure. Junk DNA requires no maintenance, so it provides no evidence either way. Selection pressure cost could still be the reason why we don't see more knowledge about the world being translated genetically.

A gene-level way of saying the same thing is that even a gene that provides an advantage may not survive if it takes up a lot of genome space, because it will be destroyed by the large number of mutations.

Load More