ELK Proposal: Thinking Via A Human Imitator

[-]Charlie Steiner4y40

I guess I'm still missing some of the positive story. Why should the AI optimally using the encoder-human-decoder part of the graph to do computation involve explaining clearly to the human what it thinks happens in the future?

Why wouldn't it do things like showing us a few different video clips and asking which looks most realistic to us, without giving any information about what it "really thinks" about the diamond? Or even worse, why wouldn't gradient descent learn to just treat the fixed human-simulator as a "reservoir" that it can use in unintended ways?

[-]TurnTrout4y20

I do think this is what happens given the current architecture. I argued that the desired outcome solves narrow ELK as a sanity check, but I'm not claiming that the desired setup is uniquely loss-minimizing.

Part of my original intuition was "the human net is set up to be useful at predicting given honest information about the situation", and "pressure for simpler reporters will force some kind of honesty, but I don't know how far that goes." As time passed I became more and more aware of how this wasn't the best/only way for the human net to help with prediction, and turned more towards a search for a crisp counterexample.

[-]derek shiller4y20

Interesting proposal!

Is there a reason you have it switch to the human net just once in the middle?

I would worry that the predictor might switch ontologies as time goes on. Perhaps, to make the best use of the human compute time, it reasons in a human ontology up until n/2. Once the threat of translation is past, it might switch to its own ontology from n/2 to n. If so, the encoder that works up to n/2 might be useless thereafter. A natural alternative would be to have it switch back and forth some random number of times at random intervals.

[-]TurnTrout4y20

Later in the post, I proposed a similar modification:

I think we should modify the simplified hand-off procedure I described above so that, during training:
A range of handoff thresholds and proportions are drawn—in particular, there should be a reasonable probability of drawing $p$ values close to 0, close to 1, and also 0 and 1 exactly.
The human net runs for $p n$ steps before calling the reporter.

[-]Thomas4y00

I really like you're putting up your proposal for all to read. It lets stumps like me learn a lot. So, thanks for showing it!

You asked for some ideas about things you want clarity on. I can't help you yet, but I'd like to ask some questions first, if I can.
Some about the technical implementation and some more to do with the logic behind your proposal.

First some technical questions:
1) What if the original AI is not some recurrent 'thinking' AI using compute steps, but something that learns a single function mapping from input to output? Like a neural net using 1 hidden layer. Would this still work?
2) When do we exactly train the encoder/decoder? We train it on the predictor states, right? But wouldn't adding the human imitator degrade our predictor performance by dimensionality reduction making our predictor some less good variant of the one we are interested in?
3) What is the difference between an encoder and a reporter that reports on all of the state? Why wouldn't all misalignment issues with the reporter occur in the encoder/human imitator in an identical way?

And some other questions:
1) I assume your plan with the human imitator step is to apply or incorporate the idea of a 'human prior' (thinking like a human)? My question is: if you have some static function sandwiched between two unbounded priorless optimizing parts, would it actually guarantee that the human prior is applied? Or could the encoder/reporter misuse the human imitator? Or in other words, you assume a faithful encoder, but why can we trust that assumption? ( I know you say that it would be implausible that for the AI to distinguish between training and the real world for example, but exactly that shift from simple to complex makes simple learned functions generalize badly ).

2) Say you have this human imitator you trained on a understandable dataset using the states of the super AI. For simplicity let's say it trained on a set of x's and y's. And let's say our human could only count to 10 (10 fingers after all), so it's all simple numbers between 0 and 10 (representation our simple scenario's). And let's say it learns that Y is some function of X like f(x): 2x+1 for example. Now this encoder comes along and it tries to explain this infinitely complex super AI to this small human imitator. It tries to 'dumb it down' but complex things are simply not something you can count on your fingers. In an effort to make the human imitator understand it starts forcing big numbers like eleventy-one down it's function. It needs to get thru enough information for the decoder to recover most of it, after all. With some effort our human imitator reproduces some numbers along it's line, something it usually does not. Can it still be said that it thinks 'like a human'?
My question in a more formal way is this: when the dataset (actions in the case of the ELK example) shifts from simple to complex, and the presumed (already exponential) mismatch between the super AI and the human imitator becomes larger ((thoughts of) states of the diamond yet unheard of) the input to human imitator shifts in complexity as well, but as it's processing this more complex information how do we know that 'think like a human' generalizes to in a space unknown to humans?

3) You proposal requires a dimensionality reduction of the state space of a super AI which might be unfathomably big, done by an encoder. And in the encoder/decoder setup, it requires this to be done without too much information loss but also require it to loose enough get it into the human imitator. This leaves some window. Couldn't it be said that in the worst case, there is no such window? (like described in the mentioned report in the example using a simple interpretable net).

[-]Charlie Steiner4y10

I think the proposal is not to force the AI to funnel all its state through the human, but instead to make the human-simulation available as extra computing resources that the most-effective AIs should take advantage of.

^{^}

This proposal only seems to require that both agents reason in a way where “amount of time thinking” is a sensible parameter.

^{^}

Perfect imitation is a simplifying assumption which may not be realistic, but whose presence does not make ELK obviously trivial. I think that if this approach works given the assumption, we can probably relax it.

^{^}

The decoder affects the AI’s predictive capabilities during training; it is logically unrelated to whether the system solves ELK, or to its capabilities during deployment. Practically, absent a good decoder, there is little reason for a faithful encoder to be learned.

^{^}

Even if this holds, however, we will still have to reason about whether the two components can be jointly learned, or if the intended encoder can be learned before the reporter. Let’s set that aside for now.

^{^}

We avoid the pitfalls of regularizing the ELK-reporter because our intended reporter doesn’t have to translate from the superhuman predictor, it just has to deal with the constant-size frozen human imitator network.

^{^}

This logic fails if the AI predictor can get pixel-perfect video prediction on the training distribution, even without using the human’s portion of the time budget. However, this is presumably dealt with by having some chance of p being close to 1—so that the human net has to do most of the information processing.

^{^}

This probably rules out a class of “fragile” encodings which don’t make sense upon human reflection, which might push towards more faithful and robust encoders.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

20

ELK Proposal: Thinking Via A Human Imitator

20

Previously, on ELK

Thinking Via A Human Imitator

Training procedure

Desired outcome

Learning the reporter

Learning the decoder

Learning the encoder

Candidate counterexamples

Competitiveness

Conclusion