Why are you sure that effective "evals" can exist even in principle?
Relatedly, the point which is least clear to me is what exactly would it mean to solve the "proper elicitation problem" and what exactly are the "requirements" laid out by the blue line on the graph. I think I'd need to get clear on this problem scope before beginning to assess whether this elicitation gap can even in principle be crossed via the methods which are being proposed (i.e. better design & coverage of black box evaluations).
As a non-example, possessing the kind of foundational scientific understanding which would allow someone to confidently say "We have run this evaluation suite and we now know once and for all that this system is definitely not capable of x, regardless of whatever elicitation techniques are developed in the future" seems me to be Science-of-AI-complete and is thus a non-starter for a north star for an agenda aimed at developing stronger inability arguments.
When I fast forward the development of black box evals aimed at supporting inability arguments, I see us arriving at a place where we have:
Which would allow us to make the claim "Given these trends in PTEs, and this coverage in evaluations, experts have vibed out that the probability that of this model being capable of producing catastrophe x is under an acceptable threshold" for a wider range of domains. To be clear, that's a better place than we are now and something worth striving for but not something which I would qualify as "having solved the elicitation problem". There are fundamental limitations to the kinds of claims which black box evaluations can reasonably support, and if we are to posit that the "elicitation gap" is solvable it needs have the right sorts of qualifications, amendments and hedging such that it's on the right side of this fundamental divide.
Note, I don't work on evals and expect that others have better models than this. My guess is that @Marius Hobbhahn has strong hopes on the field developing more formal statistical guarantees and other meta-evaluative practices as outlined in the references in the science of evals post, and would thus predict a stronger safety case sketch than the one laid out in the previous paragraph, but what the type signature of that sketch would be, and consequently how reasonable this sketch is given fundamental limitations of black box evaluations, is currently unclear to me.
For clarity, how do you distinguish between P1 & P4?
I'm curious on how/if goal coherence over long term plans is explained by your "planning as reward shaping" model? If planning amounts to an escalation of more and more "real thoughts" (i.e. I'm idly thinking about prinsesstårta -> A fork-full of prinsesstårta is heading towards my mouth), because these correspond to stronger activations in a valenced latent in my world model, and my thought generator is biased towards producing higher valence thoughts, it's unclear to me why we wouldn't just default to the production of untopical thoughts (i.e. I'm idly thinking about prinsesstårta -> I'm thinking about underneath a weighted blanket) and never get anything done in the world.
One reply would be to bite the bullet and say yup, humans due in fact have deficits in their long term planning strategies and this accounts for them but this feels unsatisfying; if the story given in my comment above was the only mechanism I'd expect us to be much worse. One possible reply is that "non-real thoughts" don't reliably lead towards rewards from the steering subsystem and the thought assessors down weight the valence associated w/ these thoughts thus leading to them being generated w/ lower frequency; consequently then, the only thought sequences which remain are ones which terminate in "real thoughts" and stimulate accurate predictions of the steering subsystem. This seems plausibly sufficient, but it still doesn't answer the question of why people don't arbitrarily switch into "equally real but non-topical" thought sequences at higher frequencies.