Alex Turner, Oregon State University PhD student working on AI alignment. Reach me at jobeal2[at]gmail[dot]com.
Hm. I've often imagined a "keep the diamond safe" planner just choosing a plan which a narrow-ELK-solving reporter says is OK.
How do you imagine the reporter being used?
Later in the post, I proposed a similar modification:
I think we should modify the simplified hand-off procedure I described above so that, during training:A range of handoff thresholds and p proportions are drawn—in particular, there should be a reasonable probability of drawing p values close to 0, close to 1, and also 0 and 1 exactly.The human net runs for pn steps before calling the reporter.
I think we should modify the simplified hand-off procedure I described above so that, during training:
I do think this is what happens given the current architecture. I argued that the desired outcome solves narrow ELK as a sanity check, but I'm not claiming that the desired setup is uniquely loss-minimizing.
Part of my original intuition was "the human net is set up to be useful at predicting given honest information about the situation", and "pressure for simpler reporters will force some kind of honesty, but I don't know how far that goes." As time passed I became more and more aware of how this wasn't the best/only way for the human net to help with prediction, and turned more towards a search for a crisp counterexample.
Thanks for your reply!
What we're actually doing is here is defining "automated ontology identification"
(Flagging that I didn't understand this part of the reply, but don't have time to reload context and clarify my confusion right now)
If you deny the existence of a true decision boundary then you're saying that there is just no fact of the matter about the questions that we're asking to automated ontology identification. How then would we get any kind of safety guarantee (conservativeness or anything else)?
When you assume a true decision boundary, you're assuming a label-completion of our intuitions about e.g. diamonds. That's the whole ball game, no?
But I don't see why the platonic "true" function has to be total. The solution does not have to be able to answer ambiguous cases like "the diamond is molecularly disassembled and reassembled", we can leave those unresolved, and let the reporter say "ambiguous." I might not be able to test for ambiguity-membership, but as long as the ELK solution can:
Then a planner—searching for a "Yes, the diamond is safe" plan—can reasonably still end up executing plans which keep the diamond safe. If we want to end up in realities where we're sure no one is burning in a volcano, that's fine, even if we can't label every possible configuration of molecules as a person or not. The planner can just steer into a reality where it unambiguously resolves the question, without worrying about undefined edge-cases.
Why would the planner have pressure to choose something which looks good to the predictor, but is secretly bad, given that it selects a plan based on what the reporter says? Is this a Goodhart's curse issue, where the curse afflicts not the reporter (which is assumed conservative, if it's the direct translator), but the predictor's own understanding of the situation?
I don't understand why a strong simplicity guarantee places most of the difficulty on the learning problem. In the diamond situation, a strong simplicity requirement on the reporter can mean that the direct translator gets ruled out, since it may have to translate from a very large and sophisticated AI predictor?
if automated ontology identification does turn out to be possible from a finite narrow dataset, and if automated ontology identification requires an understanding of our values, then where did the information about our values come from? It did not come from the dataset because we deliberately built a dataset of human answers to objective questions. Where else did it come from?
Perhaps I miss the mystery. My first reaction is, "It came from the assumption of a true decision boundary, and the ability to recursively deploy monotonically better generalization while maintaining conservativeness."
But an automated ontology identifier that would be guaranteed safe if tasked with extrapolating our concepts still brings up the question of how that guarantee was possible without knowledge of our values. You can’t dodge the puzzle
I feel like this part is getting slippery with how words are used, in a way which is possibly concealing unimagined resolutions to the apparent tension. Why can't I dodge the puzzle? Why can't I have an intended ELK reporter which answers the easy questions, and a small subset of the hard questions, without also being able to infinitely recurse an ELK solution to get better and better conservative reporters?
Why is this what you often imagine? I thought that in classic ELK architectural setup, the planner uses the outputs of both a predictor and reporter in order to make its plans, eg using the reporter to grade plans and finding the plan which most definitely contains a diamond (according to the reporter). And the simplest choice would be brute-force search over action sequences.
After all, here's the architecture:
But in your setup, the planner would be another head branching from "figure out what's going on", which means that it's receiving the results of a computation already conditioned on the action sequence?
How do we know that the "prediction extractor" component doesn't do additional serious computation, so that it knows something important that the "figure out what's going on" module doesn't know? If that were true, the AI as a whole could know the diamond was stolen, without the "figure out what's going on" module knowing, which means even the direct translator wouldn't know, either. Are we just not giving the extractor that many parameters?
How the power-seeking theorems relate to the selection theorem agenda.
If we understood both of these, as a bonus we would be much better able to predict P(power-seeking | environment, training process) via P(power-seeking | agent internals) P(agent internals | environment, training process).
For power-seeking, agent internals screens off the environment and training process.
I recommend reading the quoted post for clarification.