DanielFilan

Sequences

AXRP - the AI X-risk Research Podcast

Comments

Some talks are visible on YouTube here

Did this ever get written up? I'm still interested in it.

Any reversible effect might be reversed. The question asks about the final effects of the mind

This talk of "reversible" and "final" effects of a mind strikes me as suspicious: for one, in a block / timeless universe, there's no such thing as "reversible" effects, and for another, in the end, it may wash out in an entropic mess! But it does suggest a rephrasing of "a first-order approximation of the (direction of the) effects, understood both spatially and temporally".

Is the idea that the set of "states" is the codomain of gamma?

 assigns the set of states that remain possible once a node is reached.

What's bold S here?

We argue that a sustained commitment to transparency (e.g. to auditors) would make the RLHF research environment more robust from a safety standpoint.

Do you think this is more true of RLHF than other safety techniques or frameworks? At first blush, I would have thought "no", and the reasoning you provide in this post doesn't seem to distinguish RLHF from other things.

I think I probably didn't quite word that question right, and that's what's explaining the confusion - I meant something like "Once you've created the AAR, what alignment problems are left to be solved? Please answer in terms of the gap between the AAR and superintelligence."

Note that this paper in section 2.6 claims to demonstrate that some 175 billion parameter RLHF'ed LLM (presumably Claude or Claude 2?) does not use steganography in its CoT. It does this by:

  1. Get a bunch of answers from the model using CoT prompting.
  2. Take prefixes of CoTs, asking another language model to paraphrase them, regenerate the rest of the CoT, and see how accurate the models are now.
  3. Check if the model loses accuracy for paraphrased CoTs.

The basic result is that paraphrasing parts of the CoT doesn't appear to reduce accuracy.

Load More