Redwood Research used to have a project about trying to prevent a model from outputting text where a human got hurt, which IIRC, they did primarily by trying to fine tunes and adversarial training. (Followup). It would be interesting to see if one could achieve better results then they did at the time through subtracting some sort of hurt/violence vector.

Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text:

In Table 3, they show in some cases task vectors can improve fine-tuned models.

This response is enraging.

Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex.

I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does ... (read more)

1Matthew "Vaniver" Gray1y
I have attempted to respond to the whole post over here.

Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me.  To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.

Like things, simulacra are probabilistically generated by the laws of physics (the simulator), but have properties that are arbitrary with respect to it, contingent on the initial prompt and random sampling (splitting of the timeline).

What do the smarter simulacra think about the physics of which they find themselves in? If one was very smart, could they look at what the probabilities of the next token, and wonder about why some tokens get picked over others? Would they then wonder about how the "waveform collapse" happens and what it means?

It's not even necessary for simulacra to be able to "see" next token probabilities for them to wonder about these things, just as we can wonder about this in our world without ever being able to see anything other than measurement outcomes. It happens that simulating things that reflect on simulated physics is my hobby. Here's an excerpt from an alternate branch of HPMOR I generated: As to the question of whether a smart enough simulacrum would be able to see token probabilities, I'm not sure. Output probabilities aren't further processed by the network, but intermediate predictions such as revealed by the logit lens are.

Given that there's a lot of variation in how humans extrapolate values, whose extrapolation process do you intend to use?

0Stuart Armstrong2y
We're aiming to solve the problem in a way that is acceptable to one given human, and then generalise from that.