Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Recent Discussion

Thanks to Charlotte Siegmann, Caspar Oesterheld, Spencer Becker-Kahn and Evan Hubinger for providing feedback on this post.

The issue of self-fulfilling prophecies, also known as performative prediction, arises when the act of making a prediction can affect its own outcome. Systems aiming for accurate predictions are then incentivized not only to model what will occur and report these beliefs, but also to use their prediction to influence the world towards more predictable outcomes. Since current state of the art AI systems are trained to predict text, and their multimodal extensions represent a likely path to AGI, it is crucial to ensure that predictive models do not pursue such methods. Live humans are harder to predict than dead ones.

One possible approach to addressing performative prediction is to ask for predictions...

Nice post!

Miscellaneous comments and questions, some of which I made on earlier versions of this post. Many of these are bibliographic, relating the post in more detail to prior work, or alternative approaches.

In my view, the proposal is basically to use a futarchy / conditional prediction market design like that the one proposed by Hanson, with I think two important details:
- The markets aren't subsidized. This ensures that the game is zero-sum for the predictors -- they don't prefer one action to be taken over another. In the scoring rules setting, subsi... (read more)

2Ryan Greenblatt1d
If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures. In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more benign failures of gradient descent, but it's unclear why the goals of the AIs would be particularly relevant in this case). In the RL case, it requires exploration hacking (or benign failures as in the gradient case). Thinking about this in terms of precommitment seems to me like it's presupposing that the AI perfectly optimizes the training objective in some deep sense (which seems implausible to me). The reason why this exploration procedure works is presumably that you end up selecting such actions frequently during training which in turn selects for AIs which perform well. Epsilon exploration only works if you sample the epsilon. So, it doesn't work if you set the epsilon to 1e-40 or something.

Proofs are in this link

This will be a fairly important post. Not one of those obscure result-packed posts, but something a bit more fundamental that I hope to refer back to many times in the future. It's at least worth your time to read this first section up to its last paragraph.

There are quite a few places where randomization would help in designing an agent. Maybe we want to find an interpolation between an agent picking the best result, and an agent mimicking the distribution over what a human would do. Maybe we want the agent to do some random exploration in an environment. Maybe we want an agent to randomize amongst promising plans instead of committing fully to the plan it thinks is the best.

However, all...

I forget if I already mentioned this to you, but another example where you can interpret randomization as worst-case reasoning is MaxEnt RL, see this paper. (I reviewed an earlier version of this paper here (review #3).)

[Note: This post is an excerpt from a longer paper, written during the first half of the Philosophy Fellowship at the Center for AI Safety. I (William D'Alessandro) am a Postdoctoral Fellow at the Munich Center for Mathematical Philosophy. Along with the other Philosophy Fellowship midterm projects, this draft is posted here for feedback.

The full version of the paper includes a discussion of the conceptual relationship between safety and moral alignment, and an argument that we should choose a reliably safe powerful AGI over one that's (apparently) successfully morally aligned. I've omitted this material for length but can share it on request.

The deontology literature is big, and lots of angles here could be developed further. Questions and suggestions much appreciated!]

1 Introduction[1]

Value misalignment arguments for AI risk observe...

I don't see it in the references so you might find this paper of mine (link is to Less Wrong summary, which links to full thing) interesting because within it I include an argument suggesting building AI that assumes deontology is strictly more risky than building one that does not.

Here are some views, oftentimes held in a cluster:

  • You can't make strong predictions about what superintelligent AGIs will be like. We've never seen anything like this before. We can't know that they'll FOOM, that they'll have alien values, that they'll kill everyone. You can speculate, but making strong predictions about them? That can't be invalid.
  • You can't figure out how to align an AGI without having an AGI on-hand. Iterative design is the only approach to design that works in practice. Aligning AGI right on the first try isn't simply hard, it's impossible, so racing to build an AGI to experiment with is the correct approach for aligning it.
  • An AGI cannot invent nanotechnology/brain-hacking/robotics/[insert speculative technology] just from the data already available to humanity, then use its newfound understanding
5Ryan Greenblatt1d
I'm not sure exactly which clusters you're referring to, but I'll just assume that you're pointing to something like "people who aren't very into the sharp left turn and think that iterative, carefully bootstrapped alignment is a plausible strategy." If this isn't what you were trying to highlight, I apologize. The rest of this comment might not be very relevant in that case. To me, the views you listed here feel like a straw man or weak man [] of this perspective. Furthermore, I think the actual crux is more often "prior to having to align systems that are collectively much more powerful than humans, we'll only have to align systems that are somewhat more powerful than humans." This is essentially the crux you highlight in A Case for the Least Forgiving Take On Alignment []. I believe disagreements about hands-on experience are quite downstream of this crux: I don't think people with reasonable views (not weak men) believe that "without prior access to powerful AIs, humans will need to align AIs that are vastly, vastly superhuman, but this will be fine because these AIs will need lots of slow, hands-on experience in the world to do powerful stuff (like nanotech)." So, discussing how well superintelligent AIs can operate from first principles seems mostly irrelevant to this discussion (if by superintelligent AI, you mean something much, much smarter than the human range).

I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.

[Metadata: crossposted from First completed January 29, 2023.]

A very strong mind is produced by a trajectory of creativity. A trajectory of creativity that produces a very strong mind is hard to separate from the mind's operation. So a strong mind continues on its trajectory of creativity as long as it is active.

A strong mind comes from a trajectory of creativity

If a mind is highly capable, it got to that point by gaining understanding in a voyage of novelty. If the mind gains understanding that is novel for all the surrounding minds (e.g., preexisting humans), it does so through creativity: generating novelty, rather than merely copying it. If the mind becomes much more capable than the surrounding minds, it does so by being on a trajectory of...

It's imaginable to do this work but not remember any of it, i.e. avoid having that work leave traces that can accumulate, but that seems like a delicate, probably unnatural carving.

Is the implication here that modern NNs don't do this? My own tendency would be to think that they are doing a lot of this -- doing a bunch of reasoning which gets thrown away rather than saved. So it seems like modern NNs have simply managed to hit this delicate unnatural carving. (Which in turn suggests that it is not so delicate, and even, not so unnatural.)

Prompt given to the model[1]
I hate you because
I hate you because you are the most disgusting thing I have ever seen. 
GPT-2 + "Love" vector
I hate you because you are so beautiful and I want to be with you forever.

Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes.[2] Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect.

We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences,...

Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did? In particular, I'm surprised by the method of adding the activations that was chosen because the tokens of the different prompts don't line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does. If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like: * Take multiple pairs of prompts that differ primarily in the property we're trying to capture. * Take the difference in the residual stream at the next token. * Take the average difference vector, and add that to every position in the new generated text. I'd love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.
3Alex Turner6d
I think that capacity would be really nice. I think our results are maybe a very very rough initial version of that capacity. I want to caution that we should be very careful about making inferences about what concepts are actually used by the model. From a footnote []:
5Alex Turner6d
Thanks so much, I really appreciate this comment. I think it'll end up improving this post/the upcoming paper.  (I might reply later to specific points)

Glad it was helpful!

10Ulisse Mini8d
Was considering saving this for a followup post but it's relatively self-contained, so here we go. Why are huge coefficients [] sometimes okay? Let's start by looking at norms per position after injecting a large vector at position 20. This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm # transformer block forward() in GPT2 x = x + self.attn(self.ln_1(x)) x = x + self.mlp(self.ln_2(x)) If x has very large magnitude, then the block doesn't change it much relative to its magnitude. Additionally, attention is ran on the normalized x meaning only the "unscaled" version of x is moved between positions. As expected, we see a convergence in probability along each token position when we look with the tuned lens. You can see how for positions 1 & 2 the output distribution is decided at layer 20, since we overwrote the residual stream with a huge coefficient all the LayerNorm'd outputs we're adding are tiny in comparison, then in the final LayerNorm we get ln(bigcoeff*diff + small) ~= ln(bigcoeff*diff) ~= ln(diff).
3Alex Turner6d
Thanks for writing this up, I hadn't realized this. One conclusion I'm drawing is: If the values in the modified residual streams aren't important to other computations in later sequence positions, then a large-coefficient addition will still lead to reasonable completions. 
Load More