This post is the result of work I did with Paul Christiano on the ideas in his “Teaching ML to answer questions honestly instead of predicting human answers” post. In addition to expanding upon what is in that post in terms of identifying numerous problems with the proposal there and identifying ways in which some of those problems can be patched, I think that this post also provides a useful window into what Paul-style research looks like from a non-Paul perspective.
Recommended prior reading: “A naive alignment strategy and optimisim about generalization” and “Teaching ML to answer questions honestly instead of predicting human answers” (though if you struggled with “Teaching ML to answer questions honestly,” I reexplain things in a more precise way here that might be clearer...
I'd love an answer either in operations (MIPS, FLOPS, whatever) or in dollars.
Follow-up question: How many parameters did their agents have?
I just read the paper (incl. appendix) but didn't see them list the answer anywhere. I suspect I could figure it out from information in the paper, e.g. by adding up how many neurons are in their LSTMs, their various other bits, etc. and then multiplying by how long they said they trained for, but I lack the ML knowledge to do this correctly.
Some tidbits from the paper:
For multi-agent analysis we took the final generation of the agent(generation5)andcreatedequallyspacedcheckpoints (copies of the neural network parameters) every 10 billion steps, creating a collection of 13 checkpoints.
This suggests 120 billion steps of...
(Warning: this post is rough and in the weeds. I expect most readers should skip it and wait for a clearer synthesis later.)
In a recent post I discussed one reason that a naive alignment strategy might go wrong, by learning to “predict what humans would say” rather than “answer honestly.” In this post I want to describe another problem that feels very similar but may require new ideas to solve.
In brief, I’m interested in the case where:
I've been poking at Evan's Clarifying Inner Alignment Terminology. His post gives two separate pictures (the objective-focused approach, which he focuses on, and the generalization-focused approach, which he mentions at the end). We can consolidate those pictures into one and-or graph as follows:
And-or graphs make explicit which subgoals are jointly sufficient, by drawing an arc between those subgoal lines. So, for example, this claims that intent alignment + capability robustness would be sufficient for impact alignment, but alternatively, outer alignment + robustness would also be sufficient.
The red represents what belongs entirely to the generalization-focused path. The yellow represents what belongs entirely to the objective-focused path. The blue represents everything else. (In this diagram, all the blue is on both paths, but that will not be the case...