Beth Barnes

Alignment researcher. Views are my own and not those of my employer.

Wiki Contributions


What do you think is important about pure RL agents vs RL-finetuned language models? I expect the first powerful systems to include significant pretraining so I don't really think much about agents that are only trained with RL (if that's what you were referring to).

How were you thinking this would measure Goodharting in particular? 

I agree that seems like a reasonable benchmark to have for getting ML researchers/academics to work on imitation learning/value learning. I don't think I'm likely to prioritize it - I don't think 'inability to learn human values' is going to be a problem for advanced AI, so I'm less excited about value learning as a key thing to work on.

video game companies can be extremely well-aligned with delivering a positive experience for their users

This doesn't seem obvious to me; video game companies are incentivized to make games that are as addicting as possible without putting off new users/getting other backlash. 

My impression is that they don't have the skills needed for successful foraging. There's a lot of evidence for some degree of cultural accumulation in apes and e.g. macaques. But I haven't looked into this specific claim super closely.

Thanks for the post! One narrow point:
You seem to lean at least a bit on the example of 'much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner'.  It seems to me that
a. You don't need to go to humans before you get significant accumulation of important cultural knowledge outside genes (e.g. my understanding is that unaccultured chimps die in the wild)
b.  the genetic bottleneck is a somewhat weird and contingent feature of animal evolution, and I don't think there's a clear analogy in current LLM ML paradigms

I'm not making any claims about takeoff speeds in models, just saying that I don't think arguments that are based on features that are (maybe) contingent on a genetic bottleneck support the same inference for ML. Can you make the same argument without leaning on the genetic bottleneck, or explain to me why the analogy in fact should hold?

It seems to me like this should be pretty easy to do and I'm disappointed there hasn't been more action on it yet. Things I'd try:
- reach out to various human-data-as-a-service companies like SurgeHQ, Scale, Samasource
- look for people on upwork 
- find people who write fiction on the internet (e.g. post on fanfiction forums) and offer to pay them to annotate their existing stories (not a dungeon run exactly, but I don't see why the dungeon setting is important)

I'd be interested to hear if anyone has tried these things and run into roadblocks.

I'm also interested if anyone has an explanation of why the focus is on the dungeon thing in particular rather than e.g. fiction generally.

One concern I'd have with this dataset is that the thoughts are post-hoc rationalizations for what is written rather than actually the thought process that went into it. To reduce this, you could do something like split it so one person writes the thoughts, and someone else writes the next step, without other communication.

Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don't understand exactly what's meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation - e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.

crossposting my comments from Slack thread:

Here are some debate trees from experiments I did on long-text QA  on this example short story:


Debater view 1

Debater view 2

Our conclusion was that we don’t expect debate to work robustly in these cases. In our case this was mostly because in cases where the debate is things like ’is there implied subtext A?’,  human debaters don’t really know why they believe some text does or doesn’t have a particular implication. They have some mix of priors about what the text might be saying (which can’t really be justified with debate), and various updates to that based on style, word choice, etc, where humans don’t necessarily have introspective access to what exactly in the text made them come to the conclusion.My guess is that’s not the limitation you’re running into here - I’d expect that to just be the depth.

There are other issues with text debates, like if the evidence is distributed across many quotes that each only provide a small amount of evidence - in this case the honest debater needs to have decent estimates for how much evidence each quote provides, so they can split their argument into something like ‘there are 10 quotes that weakly support position A’; ‘the evidence that these quotes provide is additive rather than redundant’.

[edited to fix links]

I think I’m something like 30% on ‘The highest-leverage point for alignment work is once we have models that are capable of alignment research - we should focus on maximising the progress we make at that point, rather than on making progress now, or on making it to that point - most of the danger comes after it’

Things this maybe implies:

  • We should try to differentially advance models’ ability to do alignment research relative to other abilities (abilities required to be dangerous, or abilities required to accelerate capabilities)
    • For instance, trying to make really good datasets related to alignment, e.g. by paying humans to proliferate/augment all the alignment research and writing we have so far
    • Figuring out what combination of math/code/language/arxiv etc seem to be the most conducive to alignment-relevant capabilities
    • More generally, researching how to develop models that are strong in some domains and handicapped in others
  • We should focus on getting enough alignment to extract the alignment research capabilities
    • This might mean we only need to align:
      • models that are not agentic/not actively trying to deceive you
      • Models that in many domains are subhuman
    • If we think these models are going to be close to having agency, maybe we want to avoid RL or other finetuning that incentivizes the model to think about its environment/human supervisors. Instead we might want to use some techniques that are more like interpretability or extracting latent knowledge from representations, rather than RLHF?
  • We should think about how we can use powerful models to accelerate alignment
  • We should focus more on how we would recognise good alignment research as opposed to producing it
    • For example, setups where you can safely train a fairly capable model according to some proposed alignment scheme, and see how well it works?

You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]

Load More