Cool!
It seems to me that the obvious way to cope with exploration is to allow the agents to recommend exploratory behaviors. I'm afraid that the formalism will look much uglier after this point---in particular, the performance may be equivalent to having an ensemble of agents who act in the environment, and giving influence to those agents who have behaved well in the past (as in policy gradient). (We could augment this approach by taking the minimax policy separately for each incomplete model .)
Does something in this direction play nash equilibria, or is there a problem? (Haven't thought about it much.)
It will be easier to talk about what exactly the model-based setting buys us once we have a clear account of exploration that explains how from a theoretical perspective that isn't just absorbing the whole policy. (I could imagine this somehow falling out of an exploration into decision theory, though I don't feel optimistic about it.)
From the broader perspective of the agent foundations agenda, it still seems to me like the natural next question is how you avoid or control the malignity of the universal prior. The same problem seems to afflict your universal incomplete prior, let me know if that seems wrong.
Regarding exploration, I propose the following line of attack:
Characterize classes of environments (more generally incomplete models) s.t. there is a policy with sublinear regret for all environments in the class. Something like Littlestone dimension only for the full-fledged non-oblivious case. (As far as I know, this theory doesn't exist at present?)
(Hopefully) Prove that, for sufficiently slowly falling time discount, a Bayesian / "Marketian" agent with a prior constructed from such a class, has sublinear regret for such a class.
Try to demonstrate that in multi-agent scenarios we can arrange for each agent to belong to sufficiently many incomplete models in the prior of the other agents to yield game-theoretic convergence.
Regarding Nash equilibria:
The obvious way to get something like a Nash equilibrium is to consider an incomplete model that says something like "if you follow policy , your reward will be at least that much." Then, modulo the exploration problem, if the reward estimate is sufficiently tight, your policy is guaranteed to be asymptotically at least as good as . This is not really a Nash equilibrium since each agent plays a response which is superior to some set of "simple" responses but not globally optimal. I think that in games that are in some sense "simple" it should be possible to prove that the agents are pseudorandomly sampling an actual Nash equilibrium.
Regarding malignity of the prior, AFAICT incomplete models don't improve anything. My current estimate is that this problem cannot be solved on this level of abstraction. Instead, it should be solved by carefully designing a well-informed prior over agents for the value learning process (e.g. using some knowledge of neurobiology). It is thorny.
My current estimate is that this problem cannot be solved on this level of abstraction. Instead, it should be solved by carefully designing a well-informed prior over agents for the value learning process (e.g. using some knowledge of neurobiology). It is thorny.
This seems to also be Stuart Russell's view in general, though he also imagines empirical feedback (e.g. inspection of the posterior, experience with weaker value learning systems) playing a large role. I don't think he is worried about the more exotic failure modes (I'm not sure whether I am).
Note that the same problem comes up for a an act-based/task AI's prior over the environment, or for logical uncertainty. I don't see the analogous proposal in those cases. In the best case, it seems like you will still suffer a large performance hit (if you can't make progress at this level of abstraction).
It seems quite challenging to make the AI represent the posterior in a human understandable way. Moreover, if the attacker can manipulate the posterior, it can purposefully shape it to be malicious when inspected.
Also, this is precisely the case where experience with weaker systems is close to useless. This effect only appears when the agent is capable of reasoning at least as sophisticated as the reasoning you used to come up with this problem, so the agent will be at least as intelligent as Paul Christiano. More precisely, the agent would have to be able to reason in detail about possible superintelligences, including predicting their most likely utility functions. The first AI to have this property might already be superintelligent itself.
I suspect that worrying about "exotic" failure modes might be beneficial precisely because few other people will worry about them, and the reason few other people will worry about them is that they sound like something from science fiction (even more than the "normal" AI risk stuff), which is not a good reason.
In any case, I hope that at some point we will have a mathematical model sophisticated enough to formalise this failure mode, and that would allow thinking about it much more clearly.
This post is a short informal summary of a new idea I'm starting to work on, which combines my thinking abouts "incomplete models" with Scott's logical inductors.
Motivation
Before, I speculated that generalizing Bayesian inference to include incomplete models will allow solving the grain of truth problem in a way more satisfactory than what was achieved so far. More generally, this would allow getting performance guarantees for agents in environments that are as complex or more complex than the agent.
Now, such performance guarantees are already known for learning algorithms suited to the non-realizable settings. However, as Jessica noted here, those methods don't address long-term planning due to the scarcity of training data. On the other hand, Bayesian methods do allow long-term planning: if the environment is realizable (i.e. absolutely continuous w.r.t. the prior), on-policy merging of opinions will occur at a rate that doesn't depend on the utility function. This means that for a fixed environment and sufficiently slowly falling time discount, the agent will be able to form effective long-term plans, at least as long on-policy forecasting is sufficient. Of course, realistic settings require off-policy forecasting, which requires some exploration. If we want global optimality in policy space, we would have to explore for an entire horizon which means long-term planning fails again. However, I think that satisfactory weaker optimality guarantees can be achieved by more conservative exploration, especially when "consulting a (human) expert" is an available form of "exploration".
This advantage of Bayesian agents is only applicable in the realizable case, which is an unrealistic assumption. However, the inclusion of incomplete models would bring the advantage into the non-realizable case: the environment might be arbitrarily complex, but as long as it conforms to some simple incomplete model, this model can be learnt quickly and exploited for long-term planning.
Proposal
Previously, I suggested addressing incomplete models using non-Bayesian decision rules. Here, I propose a different approach. At each moment of time, the policy is selected using normal expected utility maximization, for some posterior probability measure. However, the posterior probability measure doesn't come from updating a prior on observations. Instead, it comes from consulting an "inner prediction market": Scott's logical inductor adapted for forecasting the environment instead of forecasting logical sentences (this is somewhat similar to the "universal inductor").
In the pure forecasting setting, the formalisation seems straightforward. Instead of a "share" for each logical sentence, our market has a "share" for each event of the form e<n∈A, where A⊆On. More generally, it seems useful to consider a "share" for each continuous function f:Oω→R (the ultimate value of such a share can be any number rather than only 0 or 1). Instead of "propositionally consistent worlds" we have Oω. The "deductive process" is simply the observation of x=e<n (so that xOω are the remaining consistent worlds). The "market" itself is a sequence {μn∈P(Oω)}n∈N.
In the general setting, we can take the event space to be either (A×O)∗×A→O or (A×O)ω. In the first case, there will be some events we will never observe (similar to undecidable sentences in "classical" logical induction). In the second case, we will have to choose the policy by conditioning on it "EDT style." I suspect that the two approaches are equivalent.
Given an incomplete model Φ∈PC(E), we should be able to construct a trader TΦ that gain as long as μn∉Φ. This trader will buy the shares of some f:Oω→R s.t. Eμn[f]<minν∈ΦEν[f]. Such an f is guaranteed to exist by the Hahn-Banach separation theorem, moreover the size of the separation can probably be made (maxf−minf)dtv(μn,Φ). When the true environment satisfies μ∈Φ and TΦ is included in the set of traders that define the inductor, this should imply that dtv(μn,Φ)→0, and in particular that the policy of the agent is asymptotically Pareto optimal for Φ.