Wiki Contributions


substantial reductions in sycophancy, beyond whatever was achieved with Meta's finetuning

Where is this shown? Most of the results don't evaluate performance without steering. And the TruthfulQA results only show a clear improvement from steering for the base model without RLHF. 

I'm told that a few professors in AI safety are getting approached by high net worth individuals now but don't have a good way to spend their money. Seems like there are connections to be made.

The only team member whose name is on the CAIS extinction risk statement is Tony (Yuhuai) Wu.

(Though not everyone who signed the statement is listed under it, especially if they're less famous. And I know one person in the xAI team who has privately expressed concern about AGI safety in ~2017.)

So I'm imagining the agent doing reasoning like:

Misaligned goal --> I should get high reward --> Behavior aligned with reward function

The shortest description of this thought doesn't include "I should get high reward" because that's already implied by having a misaligned goal and planning with it. 

In contrast, having only the goal "I should get high reward" may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.

Interesting point. Though on this view, "Deceptive alignment preserves goals" would still become true once the goal has drifted to some random maximally simple goal for the first time.

To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn't seem to change this in practice. Given this, all kinds of goals could be "simple" as they piggyback on existing representations, requiring little additional description length.

See also: Your posts should be on Arxiv

I do agree we're leaving lots of value on the table and even causing active harm by not writing things up well, at least for Arxiv, for a bunch of reasons including some of the ones listed here. 

It's good to see some informed critical reflection on MI as there hasn't been much AFAIK. It would be good to see reactions from people who are more optimistic about MI!

I see. In that case, what do you think of my suggestion of inverting the LM? By default, it maps human reward functions to behavior. But when you invert it, it maps behavior to reward functions (possibly this is a one-to-many mapping but this ambiguity is a problem you can solve with more diverse behavior data). Then you could use it for IRL (with the some caveats I mentioned).

Which may be necessary since this:

The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

...seems like an unreliable mapping since any training data of the form "person did X, therefore their goal must be Y" is firstly rare and more importantly inaccurate/incomplete since it's hard to describe human goals in language. On the other hand, human behavior seems easier to describe in language.

Do I read right that the suggestion is as follows:

  • Overall we want to do inverse RL (like in our paper) but we need an invertible model that maps human reward functions to human behavior.
  • You use an LM as this model. It needs to take some useful representation of reward functions as input (it could do so if those reward functions are a subset of natural language)
  • You observe a human's behavior and invert the LM to infer the reward function that produced the behavior (or the set of compatible reward functions)
  • Then you train a new model using this reward function (or functions) to outperform humans

This sounds pretty interesting! Although I see some challenges:

  • How can you represent the reward function? On the one hand, an LM (or another behaviorally cloned model) should use it as an input so it should be represented as natural language. On the other hand some algorithm should maximize it in the final step so it would ideally be a function that maps inputs to rewards.
  • Can the LM generalize OOD far enough? It's trained on human language which may contain some natural language descriptions of reward functions, but probably not the 'true' reward function which is complex and hard to describe, meaning it's OOD.
  • How can you practically invert an LM?
  • What to do if multiple reward functions explain the same behavior? (probably out of scope for this post)

Great to see this studied systematically - it updated me in some ways.

Given that the study measures how likeable, agreeable, and informative people found each article, regardless of the topic, could it be that the study measures something different from "how effective was this article at convincing the reader to take AI risk seriously"? In fact, it seems like the contest could have been won by an article that isn't about AI risk at all. The top-rated article (Steinhardt's blog series) spends little time explaining AI risk: Mostly just (part of) the last of four posts. The main point of this series seems to be that 'More Is Different for AI', which is presumably less controversial than focusing on AI risk, but not necessarily effective at explaining AI risk.

Load More