Safety researcher at OpenAI. Views are my own and not those of my employer.
Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don't understand exactly what's meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation - e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.
crossposting my comments from Slack thread:
Here are some debate trees from experiments I did on long-text QA on this example short story:
Our conclusion was that we don’t expect debate to work robustly in these cases. In our case this was mostly because in cases where the debate is things like ’is there implied subtext A?’, human debaters don’t really know why they believe some text does or doesn’t have a particular implication. They have some mix of priors about what the text might be saying (which can’t really be justified with debate), and various updates to that based on style, word choice, etc, where humans don’t necessarily have introspective access to what exactly in the text made them come to the conclusion.My guess is that’s not the limitation you’re running into here - I’d expect that to just be the depth.
There are other issues with text debates, like if the evidence is distributed across many quotes that each only provide a small amount of evidence - in this case the honest debater needs to have decent estimates for how much evidence each quote provides, so they can split their argument into something like ‘there are 10 quotes that weakly support position A’; ‘the evidence that these quotes provide is additive rather than redundant’.
[edited to fix links]
I think I’m something like 30% on ‘The highest-leverage point for alignment work is once we have models that are capable of alignment research - we should focus on maximising the progress we make at that point, rather than on making progress now, or on making it to that point - most of the danger comes after it’
Things this maybe implies:
You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]
IMO, the alignment MVP claim Jan is making is approximately '‘we only need to focus on aligning narrow-ish alignment research models that are just above human level, which can be done with RRM (and maybe some other things, but no conceptual progress?)’'
and requires:
I'd imagine some cruxes to include:
- whether it's possible to build models capable of somewhat superhuman alignment research that do not have inner agents
- whether people will build systems that require conceptual progress in alignment to make safe before we can build the alignment MVP and get significant work out of it
As written there, the strong form of the orthogonality thesis states 'there's no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.'
I don't know whether that's intended to mean the same as 'there are no types of goals that are more 'natural' or that are easier to build agents that pursue, or that you're more likely to get if you have some noisy process for creating agents'.
I feel like I haven't seen a good argument for the latter statement, and it seems intuitively wrong to me.
Yeah, I'm particular worried about the second comment/last paragraph - people not actually wanting to improve their values, or only wanting to improve them in ways we think are not actually an improvement (e.g. wanting to have purer faith)
Random small note - the 'dungeon' theme is slightly ...culturally offputting? or something for me, as someone who's never been into this kind of thing or played any of these and is therefore a bit confused about what exactly this involves, and has vague negative associations (I guess because dungeons sound unpleasant?). I wonder if something a bit blander like a story, play, or AI assistant setting could be better?
Someone who wants to claim the bounty could just buy the dataset from one of the companies that does this sort of thing, if they're able to produce a sufficiently high-quality version, I assume? Would that be in the spirit of the bounty?
It seems to me like this should be pretty easy to do and I'm disappointed there hasn't been more action on it yet. Things I'd try:
- reach out to various human-data-as-a-service companies like SurgeHQ, Scale, Samasource
- look for people on upwork
- find people who write fiction on the internet (e.g. post on fanfiction forums) and offer to pay them to annotate their existing stories (not a dungeon run exactly, but I don't see why the dungeon setting is important)
I'd be interested to hear if anyone has tried these things and run into roadblocks.
I'm also interested if anyone has an explanation of why the focus is on the dungeon thing in particular rather than e.g. fiction generally.
One concern I'd have with this dataset is that the thoughts are post-hoc rationalizations for what is written rather than actually the thought process that went into it. To reduce this, you could do something like split it so one person writes the thoughts, and someone else writes the next step, without other communication.