Posts

Sorted by New

Wiki Contributions

Comments

Thanks, yeah I meant that I was interested in a solution that would scale to arbitrarily superhuman AI capabilities with a "mere" capabilities hit/cost (perhaps a very large cost that grows with AI capability, but does not impose a bound on the ultimate capability of the aligned system).  So this was a useful clarification for me in terms of understanding your perspective; I may be wrong but I could imagine it might be useful to lead with this a bit more, ie "we don't know of and would be very interested in solutions that might be extremely costly but that avoid all counter-examples".  Possibly you already say this and I just missed it.

Apologies for a possibly naive comment/question, perhaps this has been discussed elsewhere and you can just direct me there.  But anyway...

I would find it helpful to see a strategy that ARC believes does in fact solve ELK, but fails only because it requires taking an unacceptably large capabilities hit.  I would find this helpful for several reasons, namely 

(1) it would help me to understand what kinds of strategies you believe really do escape counter-examples, 
(2) it would give me a better sense for how optimistic to be about the approach, since it's often easier to start from an inefficient solution and make it more efficient, than it is to find an inefficient solution in the first place, and/or 
(3) if you have trouble identifying such a solution, then it would suggest to me that finding one might be a useful research direction.

I think this is an interesting project, and one that (from a very different angle) I’ve spent a bit of time on, so here are a few notes on that, followed by a few suggestions. Stella, in another comment, made several great points that I agree with and that are similar in spirit to my suggestions.

Anyway, based on a fairly similar motivation of wanting to be able to “ask a LM what it’s actually thinking/expecting”, combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most large language datasets have this information, e.g. books have titles and (maybe) blurbs, websites have titles, URLs, and (maybe) associated subreddit links, etc. This data is obviously much noisier and lower quality than what you get from paying people for annotations, but it’s voluminous, diverse, and ~free.

When inserting this metadata for pretraining, we made sure to do so completely randomly, i.e. a book title might be inserted anywhere within a book (maybe several times in different context windows etc). We added separate <META_START> and <META_STOP> tokens to indicate the beginning and end of metadata, but that’s it. The motivation was to ensure that this “thought stream” was in-distribution at all positions within the context, while conversely making it easy to never sample it (by declining to sample the start token). This means that we can both use it when prompting, and use it as a query -- ie we can ask the model, at any time, “how likely is this to be from the NYTimes vs from 4Chan” by evaluating the logprobs of text enclosed by the tokens. With this specification, one can do a kind of “metadata beam search” where you prompt, sample, evaluate, cull, and repeat. 


We generally found that this sort of works, in that the mutual information between these labels and the text goes up with model size, and you can use these metadata tags as filters to get rid of some of the most irrelevant text. But the results weren’t immediately stunning, and so we didn’t investigate them much further (to be clear, this was mostly because we prioritized other things more highly, rather than because we don't view this as worthwhile).  

So my general suggestion would be to start off with something very cheap first, like the above. At the very least, this will mean that when you finetune on higher quality data, your format is already on-distribution. But hopefully it’ll also help you to calibrate expectations and give you a better sense for exactly what kind of data you want to shell out money for.

Beyond that, I agree with what Stella said -- it seems easier and better to focus first on shorter passages, both for human-sourcing reasons, and for diversity.  Typically the benefits we see from finetuning grow with something like the log of the dataset size, so a small number of shorter examples should quickly give you an idea of what kind of progress you can expect.

If it were me, I’d also try to increase RoI by asking people to add commentary to existing books, rather than having people write from scratch.  And I’d suggest making the formatting as simple and general as possible, both so that you can use and investigate it very flexibly, and to minimize regret if you change your mind in the future.