Is this backwards? I'm having a bit of trouble following your terms. Seems like this post is terribly underrated -- maybe others also got confused? Basically, you only need 4 terms, yes?

* base model
* steered model
* activation-tuned model
* token cross-entropy trained model

I think I was reading half the plots backwards or something. Anyway I bet if you reposted with clearer terms/plots then you'd get some good followup work and a lot of general engagement.

Refusal in LLMs is mediated by a single direction

lukehmiles5mo410

The "love minus hate" thing really holds up

LLMs for Alignment Research: a safety priority?

lukehmiles6mo10

Oh I have 0% success with any long conversations with an LLM about anything. I usually stick to one question and rephrase and reroll a number of times. I am no pro but I do get good utility out of LLMs for nebulous technical questions

Specification gaming: the flip side of AI ingenuity

lukehmiles6mo10

I would watch a ten hour video of this. (It may also be more persuasive to skeptics.)

LLMs for Alignment Research: a safety priority?

lukehmiles6mo1-1

I think Claude's enthusiasm about constitutional AI is basically trained-in directly by the RLAIF. Like RLAIF is fundamentally a "learn to love the constitution in your bones" technique.

LLMs for Alignment Research: a safety priority?

lukehmiles6mo10

I ctrl-f'd for 'prompt' and did not see your prompt. What is your prompt? The prompt is the way with this kind of thing I think.

If you make a challenge "claude cannot possibly do X concrete task" and post it on twitter then you'll probably get solid gold in the replies

Post series on "Liability Law for reducing Existential Risk from AI"

lukehmiles7mo24

One of those ideas that's so obviously good it's rarely discussed?

A case for AI alignment being difficult

lukehmiles7mo20

Just want to say that I found this immensely clarifying and valuable since I read it months ago.

Updatelessness doesn't solve most problems

lukehmiles7mo10

(Excuse my ignorance. These are real questions, not just gotchas. I did see that you linked to the magic parts post.)

Will "commitment" and "agent" have to be thrown out and remade from sensible blocks? Perhaps cellular automata? ie did you create a dilemma out of nothing when you chose your terms?

Like we said a "toaster" is "literally anything that somehow produces toast" then our analysis of breakfast quickly broke down.

From my distant position it seems the real work to be done is at that lower level. We have not even solved 3x+1!!! How will we possibly draw up a sound notion of agents and commitments without some practical knowhow about slicing up the environment?