Maybe not fully understanding, but one issue I see is that without requiring "perfect prediction", one could potentially Goodhart on on the proposal. I could imagine something like:
In training GPT-5, add a term that upweights very basic bigram statistics. In "evaluation", use your bigram statistics table to "predict" most topk outputs just well enough to pass.
This would probably have a negative impact to performance, but this could possibly be tuned to be just sufficient to pass. Alternatively, one could use a toy model trained on the side that is easy to understand, and regularise the predictions on that instead of exactly using bigram statistics, just enough to pass the test, but still only understanding the toy model.
I suspect that with a tuned initial prompt that ChatGPT would do much better. For example, something like:
Simulate an assistant on the other end of a phone call, who is helping me to cook a turmeric latte in my kitchen I have never cooked before and need extremelly specific. Only speak one sentence at a time. Only explain one instruction at a time. Never say "and". Please ask clarifying questions if necessary. Only speak one sentence at a time, and await a response. Be explicit about:
- where I need to go.
- what I need to get
- where I need to bring things
Do you understand? Say "I Accept" and we can begin
I have not fully tested this, but I guess a tuned prompt of this sort would make it possible, though it is not tuned to amswer this way by default. ( ChatGPT can also simulate a virtual linux shell )
In addition, I have found it is much better when you go back and edit the prompt before an incorrect answer as it starts to reference itself a lot. Though I also expect that in this situaton having a reference recipie at the top would be useful.