Beth Barnes

Safety researcher at OpenAI. Views are my own and not those of my employer.

Wiki Contributions


Visible Thoughts Project and Bounty Announcement

It seems to me like this should be pretty easy to do and I'm disappointed there hasn't been more action on it yet. Things I'd try:
- reach out to various human-data-as-a-service companies like SurgeHQ, Scale, Samasource
- look for people on upwork 
- find people who write fiction on the internet (e.g. post on fanfiction forums) and offer to pay them to annotate their existing stories (not a dungeon run exactly, but I don't see why the dungeon setting is important)

I'd be interested to hear if anyone has tried these things and run into roadblocks.

I'm also interested if anyone has an explanation of why the focus is on the dungeon thing in particular rather than e.g. fiction generally.

One concern I'd have with this dataset is that the thoughts are post-hoc rationalizations for what is written rather than actually the thought process that went into it. To reduce this, you could do something like split it so one person writes the thoughts, and someone else writes the next step, without other communication.

A Longlist of Theories of Impact for Interpretability

Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don't understand exactly what's meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation - e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.

A Small Negative Result on Debate

crossposting my comments from Slack thread:

Here are some debate trees from experiments I did on long-text QA  on this example short story:


Debater view 1

Debater view 2

Our conclusion was that we don’t expect debate to work robustly in these cases. In our case this was mostly because in cases where the debate is things like ’is there implied subtext A?’,  human debaters don’t really know why they believe some text does or doesn’t have a particular implication. They have some mix of priors about what the text might be saying (which can’t really be justified with debate), and various updates to that based on style, word choice, etc, where humans don’t necessarily have introspective access to what exactly in the text made them come to the conclusion.My guess is that’s not the limitation you’re running into here - I’d expect that to just be the depth.

There are other issues with text debates, like if the evidence is distributed across many quotes that each only provide a small amount of evidence - in this case the honest debater needs to have decent estimates for how much evidence each quote provides, so they can split their argument into something like ‘there are 10 quotes that weakly support position A’; ‘the evidence that these quotes provide is additive rather than redundant’.

[edited to fix links]

[Link] A minimal viable product for alignment

I think I’m something like 30% on ‘The highest-leverage point for alignment work is once we have models that are capable of alignment research - we should focus on maximising the progress we make at that point, rather than on making progress now, or on making it to that point - most of the danger comes after it’

Things this maybe implies:

  • We should try to differentially advance models’ ability to do alignment research relative to other abilities (abilities required to be dangerous, or abilities required to accelerate capabilities)
    • For instance, trying to make really good datasets related to alignment, e.g. by paying humans to proliferate/augment all the alignment research and writing we have so far
    • Figuring out what combination of math/code/language/arxiv etc seem to be the most conducive to alignment-relevant capabilities
    • More generally, researching how to develop models that are strong in some domains and handicapped in others
  • We should focus on getting enough alignment to extract the alignment research capabilities
    • This might mean we only need to align:
      • models that are not agentic/not actively trying to deceive you
      • Models that in many domains are subhuman
    • If we think these models are going to be close to having agency, maybe we want to avoid RL or other finetuning that incentivizes the model to think about its environment/human supervisors. Instead we might want to use some techniques that are more like interpretability or extracting latent knowledge from representations, rather than RLHF?
  • We should think about how we can use powerful models to accelerate alignment
  • We should focus more on how we would recognise good alignment research as opposed to producing it
    • For example, setups where you can safely train a fairly capable model according to some proposed alignment scheme, and see how well it works?
[Link] A minimal viable product for alignment

You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]

[Link] A minimal viable product for alignment

IMO, the alignment MVP claim Jan is making is approximately '‘we only need to focus on aligning narrow-ish alignment research models that are just above human level, which can be done with RRM (and maybe some other things, but no conceptual progress?)’'
and requires:

  1. we can build models that are:
    1. Not dangerous themselves
    2. capable of alignment research
    3. We can use RRM to make them aligned enough that we can get useful research out of them. 
  2. We can build these models before [anyone builds models that would be dangerous without [more progress on alignment than is required for aligning the above models]]
  3. We have these models for long enough before danger and/or the models speed up alignment progress by enough that the alignment progress made during this time is comparably large to or larger than the progress made up to that date.

I'd imagine some cruxes to include:
 - whether it's possible to build models capable of somewhat superhuman alignment research that do not have inner agents
- whether people will build systems that require conceptual progress in alignment to make safe before we can build the alignment MVP and get significant work out of it

Naturalism and AI alignment

As written there, the strong form of the orthogonality thesis states 'there's no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.'

I don't know whether that's intended to mean the same as 'there are no types of goals that are more 'natural' or that are easier to build agents that pursue, or that you're more likely to get if you have some noisy process for creating agents'.

I feel like I haven't seen a good argument for the latter statement, and it seems intuitively wrong to me.

Considerations on interaction between AI and expected value of the future

Yeah, I'm particular worried about the second comment/last paragraph - people not actually wanting to improve their values, or only wanting to improve them in ways we think are not actually an improvement (e.g. wanting to have purer faith)

Visible Thoughts Project and Bounty Announcement

Random small note - the 'dungeon' theme is slightly ...culturally offputting? or something for me, as someone who's never been into this kind of thing or played any of these and is therefore a bit confused about what exactly this involves, and has vague negative associations (I guess because dungeons sound unpleasant?). I wonder if something a bit blander like a story, play, or AI assistant setting could be better?

Visible Thoughts Project and Bounty Announcement

Someone who wants to claim the bounty could just buy the dataset from one of the companies that does this sort of thing, if they're able to produce a sufficiently high-quality version, I assume? Would that be in the spirit of the bounty?

Load More