iceman - AI Alignment Forum

Steering GPT-2-XL by adding an activation vector

Redwood Research used to have a project about trying to prevent a model from outputting text where a human got hurt, which IIRC, they did primarily by trying to fine tunes and adversarial training. (Followup). It would be interesting to see if one could achieve better results then they did at the time through subtracting some sort of hurt/violence vector.

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

iceman2y534

This response is enraging.

Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex.

I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does not cause people to believe there's no reason to engage with your ideas because you will brush them off. Yes, nutpicking e/accs on Twitter is much easier and probably more hedonic, but they're not convincible and Quinton here is.

Simulacra are Things

iceman2y20

Like things, simulacra are probabilistically generated by the laws of physics (the simulator), but have properties that are arbitrary with respect to it, contingent on the initial prompt and random sampling (splitting of the timeline).

What do the smarter simulacra think about the physics of which they find themselves in? If one was very smart, could they look at what the probabilities of the next token, and wonder about why some tokens get picked over others? Would they then wonder about how the "waveform collapse" happens and what it means?

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments