This is a special post for quick takes by Ben Amitay. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

2 comments, sorted by Click to highlight new comments since: Today at 12:04 PM

I had an idea for fighting goal misgeneralization. Doesn't seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:

  1. Use IRL to learn which values are consistent with the actor's behavior.
  2. When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)

I probably don't understand the shortform format, but it seem like others can't create top-level comments. So you can comment here :)