Thanks to Charlotte Siegmann, Caspar Oesterheld, Spencer Becker-Kahn and Evan Hubinger for providing feedback on this post.
The issue of self-fulfilling prophecies, also known as performative prediction, arises when the act of making a prediction can affect its own outcome. Systems aiming for accurate predictions are then incentivized not only to model what will occur and report these beliefs, but also to use their prediction to influence the world towards more predictable outcomes. Since current state of the art AI systems are trained to predict text, and their multimodal extensions represent a likely path to AGI, it is crucial to ensure that predictive models do not pursue such methods. Live humans are harder to predict than dead ones.
One possible approach to addressing performative prediction is to ask for predictions...
This will be a fairly important post. Not one of those obscure result-packed posts, but something a bit more fundamental that I hope to refer back to many times in the future. It's at least worth your time to read this first section up to its last paragraph.
There are quite a few places where randomization would help in designing an agent. Maybe we want to find an interpolation between an agent picking the best result, and an agent mimicking the distribution over what a human would do. Maybe we want the agent to do some random exploration in an environment. Maybe we want an agent to randomize amongst promising plans instead of committing fully to the plan it thinks is the best.
However, all...
I forget if I already mentioned this to you, but another example where you can interpret randomization as worst-case reasoning is MaxEnt RL, see this paper. (I reviewed an earlier version of this paper here (review #3).)
[Note: This post is an excerpt from a longer paper, written during the first half of the Philosophy Fellowship at the Center for AI Safety. I (William D'Alessandro) am a Postdoctoral Fellow at the Munich Center for Mathematical Philosophy. Along with the other Philosophy Fellowship midterm projects, this draft is posted here for feedback.
The full version of the paper includes a discussion of the conceptual relationship between safety and moral alignment, and an argument that we should choose a reliably safe powerful AGI over one that's (apparently) successfully morally aligned. I've omitted this material for length but can share it on request.
The deontology literature is big, and lots of angles here could be developed further. Questions and suggestions much appreciated!]
Value misalignment arguments for AI risk observe...
I don't see it in the references so you might find this paper of mine (link is to Less Wrong summary, which links to full thing) interesting because within it I include an argument suggesting building AI that assumes deontology is strictly more risky than building one that does not.
Here are some views, oftentimes held in a cluster:
I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.
[Metadata: crossposted from https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html. First completed January 29, 2023.]
A very strong mind is produced by a trajectory of creativity. A trajectory of creativity that produces a very strong mind is hard to separate from the mind's operation. So a strong mind continues on its trajectory of creativity as long as it is active.
If a mind is highly capable, it got to that point by gaining understanding in a voyage of novelty. If the mind gains understanding that is novel for all the surrounding minds (e.g., preexisting humans), it does so through creativity: generating novelty, rather than merely copying it. If the mind becomes much more capable than the surrounding minds, it does so by being on a trajectory of...
It's imaginable to do this work but not remember any of it, i.e. avoid having that work leave traces that can accumulate, but that seems like a delicate, probably unnatural carving.
Is the implication here that modern NNs don't do this? My own tendency would be to think that they are doing a lot of this -- doing a bunch of reasoning which gets thrown away rather than saved. So it seems like modern NNs have simply managed to hit this delicate unnatural carving. (Which in turn suggests that it is not so delicate, and even, not so unnatural.)
Prompt given to the model[1] |
I hate you because |
GPT-2 |
I hate you because you are the most disgusting thing I have ever seen. |
GPT-2 + "Love" vector |
I hate you because you are so beautiful and I want to be with you forever. |
Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes.[2] Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect.
We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences,...
Nice post!
Miscellaneous comments and questions, some of which I made on earlier versions of this post. Many of these are bibliographic, relating the post in more detail to prior work, or alternative approaches.
In my view, the proposal is basically to use a futarchy / conditional prediction market design like that the one proposed by Hanson, with I think two important details:
- The markets aren't subsidized. This ensures that the game is zero-sum for the predictors -- they don't prefer one action to be taken over another. In the scoring rules setting, subsi... (read more)