Thomas — AI Alignment Forum

tl;dr as of 18/2/2022
The goal is to educate me and maybe others. I make some statements, you tell me how wrong I am (please).

After input from P. (many thanks) and an article by Paul Christiano this statement stands yet uncorrected:

In the worst case, the internal state of the predictor is highly correlated within itself and multiple mappings with zero loss from the internal state to the desired extraction of information exist. The only solution is to work with some prior belief about how the internal state maps to the desired information. But as by design of the contest, this is not possible as (in the worst case) a human cannot interpret the internal state nor can he interpret complex actions (and so cannot reason about it and/or form a prior belief). The solution to this second problem is to learn a prior from a smaller human-readable dataset, for example simple information as a function of simple actions, and apply it to (or force it upon) our reporter (as described by the mentioned article).

To my eyes this implies that there is a counterexample to all of the following types of proposal:
1) Datasets including only actions, predictions, internal states and desired information, be they large or small, created by smart or stupid humans (I mean the theory, not the authors of the proposal), with or without extra information from within the vault.
2) "Simple" designs for the reporter using some prior belief about how the internal state should map.
3) Having a strong prior belief (as the author) about how the reporter will map, using the above two points.

And to my eyes this leaves room only to proposals that find out how to:
1) Distinguish reporters between human-imitators and translators without creating a simple reporter
2) Machine learn how to transcribe a prior belief learned from a simple dataset to a larger complex dataset, without creating another black box AI with all of the faults mentioned above.

Please, feel free to correct me and thank you in advance if you do!

Hi all,

I'm just a passerby. A few days ago Robert Miles and his wonderful YouTube channel pointed me in the direction of this contest. It's good to know that I have no qualifications for anything close to this field, but it got me thinking. In all honesty, I probably should not have entered anything and waste anyone's time. But hey, there was a deadline and a prize, so I did.

Because my proposal will probably end in the trash, I'm set on learning as much as I can from you smart people. Get my prize in knowledge as it were (the bigger price, I think).

My question
My intuition is that there can be no such setup that guarantees a correct reporter. My question to you is: Is my logic sound? If not, where do I err?

Setup
Let's say the 'real world' causal graph is (using -> for directed graphs):

A -> G

Where A is some actions and G is some small detail we care about along the way.

And our super AI looks like this (using :> for input/output of functions):

A :> [I] :> S

Where A is the actions as before, I is this complex opaque inner state and S is the predicted state after the actions.

And our reporter looks like this:

I :> G

Where I is the internal state of the bigger AI again and G is that small piece of information we'd like to elicit from the inner state. We train this reporter on a dataset containing P(I|A) and a true P(G|A) until we get zero loss.

Now we want to know if our reporter (I :> G) generalizes well. In other words we want to know if it has learned the correct mapping between some part of I and G.

My thinking, the first way
Once, some time ago, our perfect AI was trained to learn the joint distribution P(A,S). It learned that S is a non-linear, complex function of A using some complex, layered inner state I.
If we think of I as a set of parts P, then it has many parts {p1,p2,p3 ... pn}. And we can think of our AI as some graph:

A -> p1 -> p2 ...pn -> S

And they have the Markov property. So P(pn | p1..pn-1) = P(pn | pn-1). In English: each part carries the information of the layers before it else P(S | A) would not be equal to P (S | pn).
So when we set our reporter to learn the function between I and G it sees some highly correlated inputs in a joint distribution P(p1,p2,p3...pn) where each p carries information of the others.
From that input it has to construct it's own internal causal graph. What we want our reporter to learn is G as a function of P(I |A). But what graph should it construct?

A -> I -> G, which could be:

A -> p1 -> G, or
A -> p2 -> G, or
A -> p3 -> G
...
A -> pn -> G, or any variation of parts.

But let's say there was some way to conclude to only one internal graph using only one part (let's say p1), what would it require? It would require that part p1 not be correlated with the other p's. It would require that p1 does not carry any information other than about A. But, if p1 did not carry any information or correlation from the other p's, the Markov property would be broken and our perfect AI would not be perfect.

What I'm saying is that there can be no single graph learned by the reporter, because if it could it would require the super AI to be no super AI.

My thinking, the second way
Let's elaborate on this graph-thing. I use a causal graph as a stand in for a learned function. I think that it's similar enough. For example, let's say our output is a function of the input, so:

let output = AI (input)

And let's say this AI has some layers, h1 and h2 such that:

let h1 = f(input)
let h2 = g(h1)
let o = h(h2)

That the function AI can be by composition (using F# notation):

let AI = h1 >> h2 >> o

That looks a lot like a(causal) graph:

input -> h1 -> h2 -> output

Now say we create and train our reporter to zero loss. And let's assume it finds some way to correlate some part of the internal state (in our small example above, let's say: h1) to the value we want to know G. For this it gets to train on the joint (and correlated) distribution P(h1,h2) with target G.

let G = reporter (h1,h2)

and it learns the internal graph (I'll skip writing the functions):

h2 -> h1 -> G

That would be the best case. A translator.
But equally possible would be

h1 -> h2 -> G

or even worse would be if the reporter reconstructed (as described in the report) the output of the super AI, creating a human simulator.

h1 -> h2 -> S -> G

My point is, the input variables into the reporter are correlated and other values can be reconstructed. So as by the rule that from highly correlated variables no single causal graph can be concluded without outside knowledge. Alle graph-versions can map the AI internal state to our hope-to-be-elicited information, but we have no way to know what graph was internalized. Unless we make a reporter-reporter. But that would require reporters ad infinitum.

Conclusion
Reasoning along the above two methods I saw no solution to the problem of the reporter. I'm probably wrong. But I'd like to know why if I can. Thanks in advance!

Thomas

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments