DaemonicSigil - AI Alignment Forum

Counting arguments provide no evidence for AI doom

More generally, John Miller and colleagues have found training performance is an excellent predictor of test performance, even when the test set looks fairly different from the training set, across a wide variety of tasks and architectures.

Seems like figure 1 from Miller et al is a plot of test performance vs. "out of distribution" test performance. One might expect plots of training performance vs. "out of distribution" test performance to have more spread.

Explaining grokking through circuit efficiency

DaemonicSigil1y30

The Unexpected Clanging

DaemonicSigil1y10

Interesting. This prank seems to be one you could play on a Logical Inductor, I wonder what the outcome would be? One fact that's possibly related is that computable functions are continuous. This would imply that whatever computable function Omega applies to your probability estimate, there exists a fixed point probability you can choose where you'll be correct about the monkey probability. Of course if you're a bounded agent thinking for a finite amount of time, you might as well be outputting rational probability estimates, in which case functions like become computable for Omega.

Making Nanobots isn't a one-shot process, even for an artificial superintelligance

DaemonicSigil1y30

Counterargument 2 still seems correct to me.

Techniques like Density Functional Theory give pretty-accurate results for molecular systems in an amount of time far less than a full quantum mechanical calculation would take. While in theory quantum computing is a complexity class beyond what classical computers can handle, in practice it seems that it's possible to get quite good results, even on a classical computer. The hardness of simulating atoms and molecules looks like it depends heavily on progress in algorithms, rather than being based on the hardness of simulating arbitrary quantum circuits. Even we humans are continuing to research and come up with improved techniques for molecular simulation. Look up "Ferminet" for one relatively recent AI-based advance. The concern is that a superintelligence may be able to skip ahead in this timeline of algorithmic improvements.

While physicists sometimes claim to derive things from first principles, in practice these derivations often ignore a lot of details which still has to be justified using experiments.

Other ways that approximations can be justified:

Using a known-to-be-accurate but more expensive simulation technique to validate a newer less expensive technique.
Proving bounds on the error.
Comparing with known quantities and results. Plenty of experimental data is already on the internet, including things like the heating curve of water, which depends on the detailed interactions of a very large number of water molecules.

simulating complex nanomachines outright is exceedingly unlikely.

As you mentioned above, once you have a set of basic components in your toolbox that are well understood by you, the process of designing things becomes much easier. So you only really need the expensive physics simulations for designing your basic building blocks. After that, you can coarse-grain these blocks in the larger design you're building. When designing transistors, engineers have to worry about the geometry of the transistor and use detailed simulations of how charge carriers will flow in the semiconductor. In a circuit simulator like LTSpice that's all abstracted away.

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

DaemonicSigil1y91

This is really cool, thanks for posting it. I also would not have expected this result. In particular, the fact that the top right vector generalizes across mazes is surprising. (Even generalizing across mouse position but not maze configuration is a little surprising, but not as much.)

Since it helps to have multiple interpretations of the same data, here's an alternative one: The top right vector is modifying the neural network's perception of the world, not its values. Let's say the agent's training process has resulted in it valuing going up and to the right, and it also values reaching the cheese. Maybe it's utility looks like x+y+10*[found cheese] (this is probably very over-simplified). In that case, the highest reachable x+y coordinate is important for deciding whether it should go to the top right, or if it should go directly to the cheese. Now if we consider how the top right vector was generated, the most obvious interpretation is that it should make the agent think there's a path all the way to the top right corner, since that's the difference between the two scenarios that were subtracted to produce it. So the agent concludes that the x+y part of its utility function is dominant, and proceeds to try and reach the top right corner.

Predictions:

Algebraic value editing works (for at least one "X vector") in LMs: 85 %
1. Most of the "no" probability comes from the attention mechanism breaking this in some hard-to-fix way. Some uncertainty comes from not knowing how much effort you'd put in to get around this. If you're going to stop after the first try, then put me down for 70% instead. I'm assuming here that an X-vector should generalize across inputs, in the same way that the top right vector generalizes across mazes and mouse-positions.
Algebraic value editing works better for larger models, all else equal: 55%
1. Seems like the kind of thing that might be true, but I'm really not sure.
If value edits work well, they are also composable 70%
1. Yeah, seems pretty likely
If value edits work at all, they are hard to make without substantially degrading capabilities: 50%
1. I'm too uncertain about your qualitative judgement of what "substantial" and "capabilities" mean to give a meaningful probability here. Performance in terms of logprob almost certainly gets worse, not sure how much, and it might depend on the X-vector. Specific benchmarks and thresholds would help with making a concrete prediction here.
We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
1. "truth-telling" 50%
  1. This one seems different from and harder than the others. I can imagine a vector that decreases the network's truth-telling, but it seems a little less likely that we could make the network more likely to tell the truth with a single vector. We could find vectors that make it less likely to write fiction, or describe conspiracy theories, and we could add them to get a vector that would do both, but I don't think this would translate to increased truth telling in other situations where it would normally not tell the truth for other reasons. This assumes that your test-cases for the truth vector go beyond the test cases you used to generate it, however.
2. "love" 80%
3. "accepting death" 80%
4. "speaking French" 85%

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

DaemonicSigil1y10

Thanks for the link! Looks like they do put optimization effort into choosing the subspace, but it's still interesting that the training process can be factored into 2 pieces like that.

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

DaemonicSigil1y11

Difficulty of Alignment

I find the prospect of training on model on just 40 parameters to be very interesting. Almost unbelievable, really, to the point where I'm tempted to say: "I notice that I'm confused". Unfortunately, I don't have access to the paper and it doesn't seem to be on sci-hub, so I haven't been able to resolve my confusion. Basically, my general intuition is that each parameter in a network probably only contributes a few bits of optimization power. It can be set fairly high, fairly low, or in between. So if you just pulled 40 random weights from the network, that's maybe 120 bits of optimization power. Which might be enough for MNIST, but probably not for anything more complicated. So I'm guessing that most likely a bunch of other optimization went into choosing exactly which 40 dimensional subspace we should be using. Of course, if we're allowed to do that then we could even do it with a 1 dimensional subspace: Just pick the training trajectory as your subspace!

Generally with the mindspace thing, I don't really think about the absolute size or dimension of mindspace, but the relative size of "things we could build" and "things we could build that would have human values". This relative size is measured in bits. So the intuition here would be that it takes a lot of bits to specify human values, and so the difference in size between these two is really big. Now maybe if you're given Common Crawl, it takes fewer bits to point to human values within that big pile of information. But it's probably still a lot of bits, and then the question is how do you actually construct such a pointer?

Demons in Gradient Descent

I agree that demons are unlikely to be a problem, at least for basic gradient descent. They should have shown up by now in real training runs, otherwise. I do still think gradient descent is a very unpredictable process (or to put it more precisely: we still don't know how to predict gradient descent very well), and where that shows up is in generalization. We have a very poor sense of which things will generalize and which things will not generalize, IMO.

A challenge for AGI organizations, and a challenge for readers

DaemonicSigil2y20

On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:

Practical considerations: AI systems currently tend to require lots of examples and it's expensive to get these if they all have to be provided by a human.
Some actions look good to a casual human observer, but are actually bad on closer inspection. The AI would be rewarded for finding and taking such actions.
If you're training a neural network, then there are generically going to be lots of adversarial examples for that network. As the AI gets more and more powerful, we'd expect it to be able to generate more and more situations where its learned value function gives a high reward but a human would give a low reward. So it seems like we end up playing a game of adversarial example whack-a-mole for a long time, where we're just patching hole after hole in this million-dimensional bucket with thousands of holes. Probably the AI manages to kill us before that process converges.
To make the above worse, there's this idea of a sharp left turn, where a sufficiently intelligent AI can think of very weird plans that go far outside of the distribution of scenarios that it was trained on. We expect generalization to get worse in this regime, and we also expect an increased frequency of adversarial examples. (What would help a lot here is designing the AI to have an interpretable planning system, where we could run these plans forward and negatively reinforce the bad ones (and maybe all the weird ones, because of corrigibility reasons, though we'd have to be careful about how that's formulated because we don't want the AI trying to kill us because it thinks we'd produce a weird future).)
Once the AI is modelling reality in detail, its reward function is going to focus on how the rewards are actually being piped to the AI, rather than the human evaluator's reaction, let alone of some underlying notion of goodness. If the human evaluators just press a button to reward the AI for doing a good thing, the AI will want to take control of that button and stick a brick on top of it.

On training models to assist in human evaluation and point out flaws in AI outputs: Doing this is probably somewhat better than not doing it, but I'm pretty skeptical that it provides much value:

The AI can try and fool the critic just like it would fool humans. It doesn't even need a realistic world model for this, since using the critic to inform the training labels leaks information about the critic to the AI.
It's therefore very important that the critic model generates all the strong and relevant criticisms of a particular AI output. Otherwise the AI could just route around the critic.
On some kinds of task, you'll have an objective source of truth you can train your model on. The value of an objective source of truth is that we can use it to generate a list of all the criticisms the model should have made. This is important because we can update the weights of the critic model based on any criticisms it failed to make. On other kinds of task, which are the ones we're primarily interested in, it will be very hard or impossible to get the ground truth list of criticisms. So we won't be able to update the weights of the model that way when training. So in some sense, we're trying to generalize this idea of "a strong a relevant criticism" between these different tasks of differing levels of difficulty.
This requirement of generating all criticisms seems very similar to the task of getting a generative model to cover all modes. I guess we've pretty much licked mode collapse by now, but "don't collapse everything down to a single mode" and "make sure you've got good coverage of every single mode in existence" are different problems, and I think the second one is much harder.

On using AI systems, in particular large language models, to advance alignment research: This is not going to work.

LLMs are super impressive at generating text that is locally coherent for a much broader definition of "local" than was previously possible. They are also really impressive as a compressed version of humanity's knowledge. They're still known to be bad at math, at sticking to a coherent idea and at long chains of reasoning in general. These things all seem important for advancing AI alignment research. I don't see how the current models could have much to offer here. If the thing is advancing alignment research by writing out text that contains valuable new alignment insights, then it's already pretty much a human-level intelligence. We talk about AlphaTensor doing math research, but even AlphaTensor didn't have to type up the paper at the end!
What could happen is that the model writes out a bunch of alignment-themed babble, and that inspires a human researcher into having an idea, but I don't think that provides much acceleration. People also get inspired while going on a walk or taking a shower.
Maybe something that would work a bit better is to try training a reinforcement-learning agent that lives in a world where it has to solve the alignment problem in order to achieve its goals. Eg. in the simulated world, your learner is embodied in a big robot, and it there's a door in the environment it can't fit through, but it can program a little robot to go through the door and perform some tasks for it. And there's enough hidden information and complexity behind the door that the little robot needs to have some built-in reasoning capability. There's a lot of challenges here, though. Like how do you come up with a programming environment that's simple enough that the AI can figure out how to use it, while still being complex enough that the little robot can do some non-trivial reasoning, and that the AI has a chance of discovering a new alignment technique? Could be it's not possible at all until the AI is quite close to human-level.

Counterarguments to the basic AI x-risk case

DaemonicSigil2y1212

I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.

This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We'd ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we'd want them to.

Threat-Resistant Bargaining Megapost: Introducing the ROSE Value

DaemonicSigil2y40

7: Did I forget some important question that someone will ask in the comments?

Yes!

Is there a way to deal with the issue of there being multiple ROSE points in some games? If Alice says "I think we should pick ROSE point A" and Bob says "I think we should pick ROSE point B", then you've still got a bargaining game left to resolve, right?

Anyways, this is an awesome post, thanks for writing it up!

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments

Difficulty of Alignment

Demons in Gradient Descent