scottviteri — AI Alignment Forum

Let me see if I am on the right page here.

Suppose I have some world state S, a transition function T : S → S, actions Action : S → S, and a surjective Camera : S -> CameraState. Since Camera is (very) surjective, seeing a particular camera image with happy people does not imply a happy world state, because many other situations involving nanobots or camera manipulation could have created that image.

This is important because I only have a human evaluation function H : S → Boolean, not on CameraState directly.
When I look at the image with the fake happy people, I use a mocked up H' : CameraState → Boolean := λ cs. H(Camera⁻¹(cs)). The issue is that Camera⁻¹ points to many possible states, and in practice I might pick whichever state is apriori most likely according to a human distribution over world states Distₕ(S).

The trick is that if I have a faithful model M : Action × CameraState → CameraState, I can back out hidden information about the state. The idea is that M must contain information about the true state, not just CameraState, in order to make accurate predictions.

The key idea is that M(a) acts like Camera ∘ T ∘ a ∘ Camera⁻¹, so we should be able to trace out which path Camera⁻¹ took, and in turn get a probability distribution over S.
So we can make a recognizer --
Recognizer : [Action] × CameraState × M → Dist(S) :=
λ actions, cs, m. normalize([sum([L₂(M(a,cs), (C∘T∘a)(hidden_state)) a∈actions]) ∀ hidden_state ∈ Camera⁻¹(cs)])
where normalize l := l/sum(l)
And lastly we can evaluate our world state using Evaluate := λ actions, cs, m. E[H(R(actions,cs,m))], and Evaluate can be used as the evaluation part of a planning loop.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments