Jared Kaplan
Jared Kaplan has not written any posts yet.

Jared Kaplan has not written any posts yet.

Apologies for a possibly naive comment/question, perhaps this has been discussed elsewhere and you can just direct me there. But anyway...
I would find it helpful to see a strategy that ARC believes does in fact solve ELK, but fails only because it requires taking an unacceptably large capabilities hit. I would find this helpful for several reasons, namely
(1) it would help me to understand what kinds of strategies you believe really do escape counter-examples,
(2) it would give me a better sense for how optimistic to be about the approach, since it's often easier to start from an inefficient solution and make it more efficient, than it is to find an inefficient solution in the first place, and/or
(3) if you have trouble identifying such a solution, then it would suggest to me that finding one might be a useful research direction.
I think this is an interesting project, and one that (from a very different angle) I’ve spent a bit of time on, so here are a few notes on that, followed by a few suggestions. Stella, in another comment, made several great points that I agree with and that are similar in spirit to my suggestions.
Anyway, based on a fairly similar motivation of wanting to be able to “ask a LM what it’s actually thinking/expecting”, combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most... (read 449 more words →)
Thanks, yeah I meant that I was interested in a solution that would scale to arbitrarily superhuman AI capabilities with a "mere" capabilities hit/cost (perhaps a very large cost that grows with AI capability, but does not impose a bound on the ultimate capability of the aligned system). So this was a useful clarification for me in terms of understanding your perspective; I may be wrong but I could imagine it might be useful to lead with this a bit more, ie "we don't know of and would be very interested in solutions that might be extremely costly but that avoid all counter-examples". Possibly you already say this and I just missed it.