How important are MDPs for AGI (Safety)?

Regarding regret bounds, I don't think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.

First, you can have a *subjective* regret bound which doesn't require all actions to be recoverable (it does require *some* actions to be *approximately* recoverable, which is indeed the case in the real world).

Second, dealing rationally with non-recoverable actions should still translate into mathematical conditions some of which might still look like sort of regret bounds, and in any case finite MDPs are a natural starting point for analyzing them.

Third, checking regret bounds for priors in which all actions *are* recoverable serves as a sanity test for candidate AGI algorithms. It is not a *sufficient* desideratum, but I do think it is necessary.

But I think many of the difficulties with general intelligence are not captured in the simple setting

I agree that *some* of the difficulties are not captured. I am curious whether you have more concrete examples in mind than what you wrote in the post?

I don't quite know what to think of continuous MDPs. I'll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it's not a whole lot more powerful than the finite-state MDP formalism.

This seems wrong to me. Can you elaborate what do you mean by "powerful" in this context? Continuous MDPs definitely describe a large variety of environments that cannot be captured by a finite state MDP, at least not without approximations. Solving continuous MDPs can also be much more difficult than finite state MDPs. For example, any POMDP can be made into a continuous MDP by treating beliefs as states, and finding the optimal policy for a POMDP is PSPACE-hard (as opposed to the case of finite state MDPs which is P-easy).

But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there's a good chance that we'll get for free the don't-evaluate-plans-twice behavior.

I guess that you might be thinking exclusively of algorithms that have something like a uniform prior over transition kernels. In this case there is obviously no way to learn about a state without visiting it. But we can also consider algorithms with more sophisticated priors and get much faster learning rates (if the environment is truly sampled from this prior ofc). The best example is, I think, the work of Osband and Van Roy where a regret bound is derived that scales with a certain *dimension parameter* of the hypothesis space (that can be much smaller than the number of states and actions), work on which I continued to build.

Thinking About Filtered Evidence Is (Very!) Hard

The problem is not in one of the conditions separately but in their conjunction: see my follow-up comment. You could argue that learning an exact model of Carol doesn't really imply condition 2 since, although the model *does* imply everything Carol is ever going to say, Alice is not capable of extracting this information from the model. But then it becomes a philosophical question of what does it mean to "believe" something. I think there is value in the "behaviorist" interpretation that "believing X" means "behaving optimally given X". In this sense, Alice can separately believe the two facts described by conditions 1 and 2, but cannot believe their conjunction.

How important are MDPs for AGI (Safety)?

IMO there are two reasons why finite-state MDPs are useful.

First, proving regret bounds for finite-state MDPs is just easier than for infinite-state MDPs (of course any environment can be thought of as an infinite-state MDP), so it serves as good warm-up even if you want to go beyond it. Certainly many problems can be captured already within this simple setting. Moreover, some algorithms and proof techniques for finite-state MDPs can be generalized to e.g. continuous MDPs (which is already a far more general setting).

Second, we may be able to combine finite-state MDP techniques with an algorithm that *learns the relevant features*, where "features" in this case corresponds to a mapping from histories to states. Now, of course there needn't be any projection into a finite state space that preserves the exact dynamics of the environment. However, if your algorithm can work with *approximate* models (as it must anyway), for example using my quasi-Bayesian approach, then such MDP models can be powerful.

Thinking About Filtered Evidence Is (Very!) Hard

I think there is some confusion here coming from the unclear notion of a Bayesian agent with beliefs about theorems of PA. The reformulation I gave with Alice, Bob and Carol makes the problem clearer, I think.

ACDT: a hack-y acausal decision theory

Well, being surprised by Omega seems rational. If I found myself in a real life Newcomb problem I would also be very surprised and suspect a trick for a while.

Moreover, we need to unpack "learns that causality exists". A quasi-Bayesian agent will eventually learn that it is part of a universe ruled by the laws of physics. The laws of physics are the ultimate "Omega": they predict the agent and everything else. Given this understanding, it is not more difficult than it should be to understand Newcomb!Omega as a special case of Physics!Omega. (I don't really have an understanding of quasi-Bayesian learning algorithms and how learning one hypothesis affects the learning of further hypotheses, but it seems plausible that things can work this way.)

Vanessa Kosoy's Shortform

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best *approximation* of the real environment. (Or, the best reward achievable by some space of policies.)

In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some *incomplete* descriptions. But, so far I haven't studied quasi-Bayesian learning algorithms much, so how do we know it is actually easier than the agnostic setting? Here is a simple example to demonstrate that it is.

Consider a multi-armed bandit, where the arm space is . First, consider the follow realizable setting: the reward is a deterministic function which is known to be a polynomial of degree at most. In this setting, learning is fairly easy: it is enough to sample arms in order to recover the reward function and find the optimal arm. It is a special case of the general observation that learning is tractable when the hypothesis space is low-dimensional in the appropriate sense.

Now, consider a closely related agnostic setting. We can still assume the reward function is deterministic, but nothing is known about its shape and we are still expected to find the optimal arm. The arms form a low-dimensional space (one-dimensional actually) but this helps little. It is impossible to predict anything about any arm except those we already tested, and guaranteeing convergence to the optimal arm is therefore also impossible.

Finally, consider the following quasi-realizable setting: each incomplete hypothesis in our class states that the reward function is *lower-bounded* by a particular polynomial of degree at most. Our algorithm needs to converge to a reward which is at least the maximum of maxima of correct lower bounds. So, the desideratum is weaker than in the agnostic case, but we still impose no hard constraint on the reward function. In this setting, we can use the following algorithm. On each step, fit the most optimistic lower bound to those arms that were already sampled, find its maximum and sample this arm next. I haven't derived the convergence rate, but it seems probable the algorithm will converge rapidly (for low ). This is likely to be a special case of some general result on quasi-Bayesian learning with low-dimensional priors.

Thinking About Filtered Evidence Is (Very!) Hard

Here's another perspective. Suppose that now Bob and Carol have symmetrical roles: each one asks a question, allows Alice to answer, and then reveals the right answer. Alice gets a reward when ey answer correctly. We can now see that perfect honesty actually *is* tractable. It corresponds to an incomplete hypothesis. If Alice learns this hypothesis, ey answer correctly any question ey already heard before (no matter who asks now and who asked before). We can also consider a different incomplete hypothesis that allows real-time simulation of Carol. If Alice learns this hypothesis, ey answer correctly any question asked by Carol. However, the *conjunction* of both hypotheses is already intractable. There's no impediment for Alice to learn both hypotheses: ey can both memorize previous answers *and* answer all questions by Carol. But, this doesn't automatically imply learning the conjunction.

Thinking About Filtered Evidence Is (Very!) Hard

From my perspective, the trouble here comes from the honesty condition. This condition hides an unbounded quantifier: "if the speaker will *ever* say something, then it is true". So it's no surprise we run into computational complexity and even computability issues.

Consider the following setting. The agent Alice repeatedly interacts with two other entities: Bob and Carol. When Alice interacts with Bob, Bob asks Alice a yes/no question, Alice answers it and receives either +1 or -1 reward depending on whether the answer is correct. When Alice interacts with Carol, Carol tells Alice some question and the answer to that question.

Suppose that Alice starts with some low-information prior and learns over time about Bob and Carol both. The honesty condition becomes "if Carol will *ever* say and Bob asks the question , then the correct answer is ". But, this condition might be computationally intractable so it is not in the prior and cannot be learned. However, weaker versions of this condition might be tractable, for example "if Carol says at time step between and , and Bob asks at time , then the correct answer is ". Since simulating Bob is still intractable, this condition cannot be expressed as a vanilla Bayesian hypothesis. However, it *can* be expressed as an incomplete hypothesis. We can also have an incomplete hypothesis that is the conjunction of this weak honesty condition with a full simulation of Carol. Once Alice learned this incomplete hypothesis, ey answer correctly *at least* those questions which Carol have already taught em or *will* teach em within 1000 time steps.

ACDT: a hack-y acausal decision theory

I think that your reasoning here is essentially the same thing I was talking about before:

...the usual philosophical way of thinking about decision theory assumes that the model of the environment is given, whereas in our way of thinking, the model is learned. This is important: for example, if AIXI is placed in a repeated Newcomb's problem, it will learn to one-box, since its model will predict that one-boxing causes the money to appear inside the box. In other words, AIXI might be regarded as a CDT, but the learned "causal" relationships are not the same as physical causality

Since then I evolved this idea into something that wins in counterfactual mugging as well, using quasi-Bayesianism.

I see the problem of counterfactuals as essentially solved by quasi-Bayesianism, which behaves like UDT in all Newcomb-like situations. The source code in your presentation of the problem is more or less equivalent to Omega in Newcomb-like problems. A TRL agent can also reason about arbitrary programs, and learn that a certain program acts as a predictor for its own actions.

This approach has some similarity with material implication and proof-based decision theory, in the sense that out of several hypothesis about counterfactuals that are consistent with observations, the decisive role is played by the most optimistic hypothesis (the one that can be exploited for the most expected utility). However, it has no problem with global accounting and indeed it solves counterfactual mugging successfully.