This is a special post for quick takes by Vanessa Kosoy. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Vanessa Kosoy's Shortform

21Vanessa Kosoy

2Alexander Gietelink Oldenziel

14Vanessa Kosoy

1Vladimir Nesov

2Vanessa Kosoy

1Vladimir Nesov

1Vanessa Kosoy

1Vladimir Nesov

1Vanessa Kosoy

1Vladimir Nesov

2Vanessa Kosoy

12Vanessa Kosoy

5Steve Byrnes

2Vanessa Kosoy

4Alex Turner

2Vanessa Kosoy

2Vanessa Kosoy

2Charlie Steiner

3Vanessa Kosoy

1Charlie Steiner

1Vanessa Kosoy

1Charlie Steiner

1Vanessa Kosoy

1Charlie Steiner

1Adam Shimi

2Vanessa Kosoy

1Adam Shimi

1Vanessa Kosoy

10Vanessa Kosoy

2AIL

2Vanessa Kosoy

3AIL

2Vanessa Kosoy

2Vanessa Kosoy

1Vanessa Kosoy

1David Manheim

1Vanessa Kosoy

1David Scott Krueger

2Vanessa Kosoy

1David Scott Krueger

2Vanessa Kosoy

10Vanessa Kosoy

9Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

1Gurkenglas

1Vanessa Kosoy

1Vanessa Kosoy

8Vanessa Kosoy

5Vanessa Kosoy

2Vanessa Kosoy

2ViktoriaMalyasova

2Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

4Vanessa Kosoy

1Martín Soto

3Vanessa Kosoy

2Vanessa Kosoy

3Vanessa Kosoy

8Vanessa Kosoy

2Alex Turner

1Vanessa Kosoy

1Vanessa Kosoy

7Vanessa Kosoy

5Vanessa Kosoy

4Vanessa Kosoy

1Charlie Steiner

1Vanessa Kosoy

1Charlie Steiner

1Vanessa Kosoy

1Charlie Steiner

1Vanessa Kosoy

1Charlie Steiner

3Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

4Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

2Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

2Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

5Vanessa Kosoy

5Vanessa Kosoy

5Vanessa Kosoy

3Alex Mennen

5Vanessa Kosoy

5Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

4Vanessa Kosoy

1Vladimir Nesov

2Vanessa Kosoy

1Vladimir Nesov

3Vanessa Kosoy

4Vanessa Kosoy

4Vanessa Kosoy

4Vanessa Kosoy

4Vanessa Kosoy

3Nisan

2Vanessa Kosoy

1Chris_Leong

2Vanessa Kosoy

1Chris_Leong

1Vladimir Slepnev

1Vanessa Kosoy

1Vladimir Slepnev

3Vanessa Kosoy

1Vanessa Kosoy

1Linda Linsefors

1Vanessa Kosoy

3Vanessa Kosoy

1Ofer

2Vanessa Kosoy

2Vanessa Kosoy

3Vanessa Kosoy

2Vanessa Kosoy

1Vanessa Kosoy

2Vanessa Kosoy

1Vanessa Kosoy

1Alex Turner

1Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

2Vanessa Kosoy

1Vanessa Kosoy

1Vanessa Kosoy

Some comments are truncated due to high volume. (⌘F to expand all)

I propose to call *metacosmology* the hypothetical field of study which would be concerned with the following questions:

- Studying the space of simple mathematical laws which produce counterfactual universes with intelligent life.
- Studying the distribution over utility-function-space (and, more generally, mindspace) of those counterfactual minds.
- Studying the distribution of the amount of resources available to the counterfactual civilizations, and broad features of their development trajectories.
- Using all of the above to produce a distribution over concretized simulation hypotheses.

This concept is of potential interest for several reasons:

- It can be beneficial to actually research metacosmology, in order to draw practical conclusions. However, knowledge of metacosmology can pose an infohazard, and we would need to precommit not to accept blackmail from potential simulators.
- The metacosmology knowledge of a superintelligent AI determines the extent to which it poses risk via the influence of potential simulators.
- In principle, we might be able to use knowledge of metacosmology in order to engineer an "atheist prior" for the AI that would exclude simulation hypotheses. However, this might be very difficult in practice.

2

Why do bad things happen to good people?

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is, *imitation learning algorithms ^{[1]} might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have*. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes

This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert...

1

This seems similar to gaining uploads prior to AGI, and opens up all those superorg upload-city amplification/distillation constructions which should get past human level shortly after. In other words, the limitations of the dataset can be solved by amplification as soon as the AIs are good enough to be used as building blocks for meaningful amplification, and something human-level-ish seems good enough for that. Maybe even GPT-n is good enough for that.

2

That is similar to gaining uploads (borrowing terminology from Egan, we can call them "sideloads"), but it's not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you're distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that's not better than plain speedup.

1

To me this seems to be essentially another limitation of the human Internet archive dataset: reasoning is presented in an opaque way (most slow/deliberative thoughts are not in the dataset), so it's necessary to do a lot of guesswork to figure out how it works. A better dataset both explains and summarizes the reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3 can do that to an extent by roleplaying Feynman).
Any algorithm can be represented by a habit of thought (Turing machine style if you must), and if those are in the dataset, they can be learned. The habits of thought that are simple enough to summarize get summarized and end up requiring fewer steps. My guess is that the human faculties needed for AGI can be both represented by sequences of thoughts (probably just text, stream of consciousness style) and easily learned with current ML. So right now the main obstruction is that it's not feasible to build a dataset with those faculties represented explicitly that's good enough and large enough for current sample-inefficient ML to grok. More compute in the learning algorithm is only relevant for this to the extent that we get a better dataset generator that can work on the tasks before it more reliably.

1

I don't see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior "stream of thought". In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.

1

I was arguing that near human level babblers (including the imitation plateau you were talking about) should quickly lead to human level AGIs by amplification via stream of consciousness datasets, which doesn't pose new ML difficulties other than design of the dataset. Superintelligence follows from that by any of the same arguments as for uploads leading to AGI (much faster technological progress; if amplification/distillation of uploads is useful straight away, we get there faster, but it's not necessary). And amplified babblers should be stronger than vanilla uploads (at least implausibly well-educated, well-coordinated, high IQ humans).
For your scenario to be stable, it needs to be impossible (in the near term) to run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain less effective than very high IQ humans. Otherwise you get acceleration of technological progress, including ML. So my point is that feasibility of imitation plateau depends on absence of compute overhang, not on ML failing to capture some of the ingredients of human general intelligence.

1

The imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a cause of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an "intrinsic" AGI might lead to the classical "foom" scenario.
EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.

1

I agree. But GPT-3 seems to me like a good estimate for how much compute it takes to run stream of consciousness imitation learning sideloads (assuming that learning is done in batches on datasets carefully prepared by non-learning sideloads, so the cost of learning is less important). And with that estimate we already have enough compute overhang to accelerate technological progress as soon as the first amplified babbler AGIs are developed, which, as I argued above, should happen shortly after babblers actually useful for automation of human jobs are developed (because generation of stream of consciousness datasets is a special case of such a job).
So the key things to make imitation plateau last for years are either sideloads requiring more compute than it looks like (to me) they require, or amplification of competent babblers into similarly competent AGIs being a hard problem that takes a long time to solve.

2

Another thing that might happen is a data bottleneck.
Maybe there will be a good enough dataset to produce a sideload that simulates an "average" person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general).
Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That's because, unless you're training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your "episodes". Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length.
In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.

I propose a new formal desideratum for alignment: the *Hippocratic principle*. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the *user's* beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).

More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let be the the user's policy in universe and the AI policy. Let be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability for any policy. Let be the value of a state from the user's subjective POV, in universe . Let be the environment in universe . Finally, let be the AI's prior over universes and ...

5

(Update: I don't think this was 100% right, see here for a better version.)
Attempted summary for morons like me: AI is trying to help the human H. They share access to a single output channel, e.g. a computer keyboard, so that the actions that H can take are exactly the same as the actions AI can take. Every step, AI can either take an action, or delegate to H to take an action. Also, every step, H reports her current assessment of the timeline / probability distribution for whether she'll succeed at the task, and if so, how soon.
At first, AI will probably delegate to H a lot, and by watching H work, AI will gradually learn both the human policy (i.e. what H tends to do in different situations), and how different actions tend to turn out in hindsight from H's own perspective (e.g., maybe whenever H takes action 17, she tends to declare shortly afterwards that probability of success now seems much higher than before—so really H should probably be taking action 17 more often!).
Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate how different actions will turn out from H's perspective much better than H herself. In other words, maybe it delegates to H, and H takes action 41, and the AI is watching this and shaking its head and thinking to itself "gee you dunce you're gonna regret that", and shortly thereafter the AI is proven correct.
OK, so now what? The naive answer would be: the AI should gradually stop delegating and start just doing the thing that leads to H feeling maximally optimistic later on.
But we don't want to do that naive thing. There are two problems:
The first problem is "traps" (a.k.a. catastrophes). Let's say action 0 is Press The History Eraser Button. H never takes that action. The AI shouldn't either. What happens is: AI has no idea (wide confidence interval) about what the consequence of action 0 would be, so it doesn't take it. This is the delegative RL thing—in the explore/exploit dilemma, the AI kinda sits b

2

This is about right.
Notice that typically we use the AI for tasks which are hard for H. This means that without the AI's help, H's probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don't expect to fail soon. Therefore, the AI can safely consider a policies of the form "in the short-term, do something H would do with marginal probability, in the long-term go back to H's policy". If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it's possible that in the new prognosis H still doesn't expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.

4

The Hippocratic principle seems similar to my concept of non-obstruction (https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility), but subjective from the human's beliefs instead of the AI's.

2

Yes, there is some similarity! You could say that a Hippocratic AI needs to be continuously non-obstructive w.r.t. the set of utility functions and priors the user could plausibly have, given what the AI knows. Where, by "continuously" I mean that we are allowed to compare keeping the AI on or turning off at any given moment.

2

"Corrigibility" is usually defined as the property of AIs who don't resist modifications by their designers. Why would we want to perform such modifications? Mainly it's because we made errors in the initial implementation, and in particular the initial implementation is not aligned. But, this leads to a paradox: if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn't it also be flawed in a way that destroys corrigibility?
In order to stop passing the recursive buck, we must assume some dimensions along which our initial implementation is not allowed to be flawed. Therefore, corrigibility is only a well-posed notion in the context of a particular such assumption. Seen through this lens, the Hippocratic principle becomes a particular crystallization of corrigibility. Specifically, the Hippocratic principle assumes the agent has access to some reliable information about the user's policy and preferences (be it through timelines, revealed preferences or anything else).
Importantly, this information can be incomplete, which can motivate altering the agent along the way. And, the agent will not resist this alteration! Indeed, resisting the alteration is ruled out unless the AI can conclude with high confidence (and not just in expectation) that such resistance is harmless. Since we assumed the information is reliable, and the alteration is beneficial, the AI cannot reach such a conclusion.
For example, consider an HDTL agent getting upgraded to "Hippocratic CIRL" (assuming some sophisticated model of relationship between human behavior and human preferences). In order to resist the modification, the agent would need a resistance strategy that (i) doesn't deviate too much from the human baseline and (ii) ends with the user submitting a favorable report. Such a strategy is quite unlikely to exist.

2

I think the people most interested in corrigibility are imagining a situation where we know what we're doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don't even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we "figure out alignment."
Maybe this is a strawman, because the thing I'm describing doesn't make strategic sense, but I think it does have some model of why we might end up with something unaligned but corrigible (for at least a short period).

3

The concept of corrigibility was introduced by MIRI, and I don't think that's their motivation? On my model of MIRI's model, we won't have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is "we won't know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure". Which, sure, but I don't see what it has to do with corrigibility.
Corrigibility is neither necessary nor sufficient for safety. It's not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it's not sufficient since an AI can be "corrigible" but cause catastrophic harm before someone notices and fixes it.
What we're supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don't say along which dimensions or how big the margin is. If it's infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there's no reason to talk about the former.

1

Very interesting - I'm sad I saw this 6 months late.
After thinking a bit, I'm still not sure if I want this desideratum. It seems to require a sort of monotonicity, where we can get superhuman performance just by going through states that humans recognize as good, and not by going through states that humans would think are weird or scary or unevaluable.
One case where this might come up is in competitive games. Chess AI beats humans in part because it makes moves that many humans evaluate as bad, but are actually good. But maybe this example actually supports your proposal - it seems entirely plausible to make a chess engine that only makes moves that some given population of humans recognize as good, but is better than any human from that population.
On the other hand, the humans might be wrong about the reason the move is good, so that the game is made of a bunch of moves that seem good to humans, but where the humans are actually wrong about why they're good (from the human perspective, this looks like regularly having "happy surprises"). We might hope that such human misevaluations are rare enough that quantilization would lead to moves on average being well-evaluated by humans, but for chess I think that might be false! Computers are so much better than humans at chess that a very large chunk of the best moves according to both humans and the computer will be ones that humans misevaluate.
Maybe that's more a criticism of quantilizers, not a criticism of this desideratum. So maybe the chess example supports this being a good thing to want? But let me keep critiquing quantilizers then :P
If what a powerful AI thinks is best (by an exponential amount) is to turn off the stars until the universe is colder, but humans think it's scary and ban the AI from doing scary things, the AI will still try to turn off the stars in one of the edge-case ways that humans wouldn't find scary. And if we think being manipulated like that is bad and quantilize over actions to m

1

When I'm deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself?
One potentially useful model is: I'm good at evaluating and bad at searching (after all, P≠NP). I can therefore delegate searching to another agent. But, as you point out, this doesn't account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stance towards the user (in order to help learning their preferences), then the user must be regarded as good at searching.
A better model is: I'm good at both evaluating and searching, but the AI can access actions and observations that I cannot. For example, having additional information can allow it to evaluate better. An important special case is: the AI is connected to an external computer (Turing RL) which we can think of as an "oracle". This allows the AI to have additional information which is purely "logical". We need infra-Bayesianism to formalize this: the user has Knightian uncertainty over the oracle's outputs entangled with other beliefs about the universe.
For instance, in the chess example, if I know that a move was produced by exhaustive game-tree search then I know it's a good move, even without having the skill to understand why the move is good in any more detail.
Now let's examine short-term quantilization for chess. On each cycle, the AI finds a short-term strategy leading to a position that the user evaluates as good, but that the user would require luck to manage on their own. This is repeated again and again throughout the game, leading to overall play substantially superior to the user's. On the other hand, this play is not as good as the AI would achieve if it just optimized for winning at chess without any constrains. So, our AI might not be competitive with an unconstrained unaligned AI

1

Agree with the first section, though I would like to register my sentiment that although "good at selecting but missing logical facts" is a better model, it's still not one I'd want an AI to use when inferring my values.
I think my point is if "turn off the stars" is not a primitive action, but is a set of states of the world that the AI would overwhelming like to go to, then the actual primitive actions will get evaluated based on how well they end up going to that goal state. And since the AI is better at evaluating than us, we're probably going there.
Another way of looking at this claim is that I'm telling a story about why the safety bound on quantilizers gets worse when quantilization is iterated. Iterated quantilization has much worse bounds than quantilizing over the iterated game, which makes sense if we think of games where the AI evaluates many actions better than the human.

1

I think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn't deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user's evaluation in the end of this short-term interval.

1

Ah. I indeed misunderstood, thanks :) I'd read "short-term quantilization" as quantilizing over short-term policies evaluated according to their expected utility. My story doesn't make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).

1

I don't understand what you mean here by quantilizing. The meaning I know is to take a random action over the top \alpha actions, on a given base distribution. But I don't see a distribution here, or even a clear ordering over actions (given that we don't have access to the utility function).
I'm probably missing something obvious, but more details would really help.

2

The distribution is the user's policy, and the utility function for this purpose is the eventual success probability estimated by the user (as part of the timeline report), in the end of the "maneuver". More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.

1

Oh, right, that makes a lot of sense.
So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?
I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?

1

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is ϵ and your quantilization fraction is ϕ then the AI's probability of corruption is bounded by ϵϕ.
Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn't specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.

*This idea was inspired by a correspondence with Adam Shimi.*

It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions , a set of observations , a policy is a mapping , a reward function is a mapping , the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows , or the prior can believe that behavior not according to leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript...

2

I am not sure I understand your use of C(U) in the third from last paragraph where you define goal directed intelligence. As you define C it is a complexity measure over programs P. I assume this was a typo and you mean K(U)? Or am I misunderstanding the definition of either U or C?

2

This is not a typo.
I'm imagining that we have a program P that outputs (i) a time discount parameter γ∈Q∩[0,1), (ii) a circuit for the transition kernel of an automaton T:S×A×O→S and (iii) a circuit for a reward function r:S→Q (and, ii+iii are allowed to have a shared component to save computation time complexity). The utility function is U:(A×O)ω→R defined by
U(x):=(1−γ)∞∑n=0γnr(sxn)
where sx∈Sω is defined recursively by
sxn+1=T(s,xn)

3

Okay, I think this makes sense. The idea is trying to re-interpret the various functions in the utility function as a single function and asking about the notion of complexity on that function which combines the complexity of producing a circuit which computes that function and the complexity of the circuit itself.
But just to check: is T over S×A×O→S? I thought T in utility functions only depended on states and actions S×A→S?
Maybe I am confused by what you mean by S. I thought it was the state space, but that isn't consistent with r in your post which was defined over A×O→Q? As a follow up: defining r as depending on actions and observations instead of actions and states (which e.g. the definition in POMDP on Wikipedia) seems like it changes things. So I'm not sure if you intended the rewards to correspond with the observations or 'underlying' states.
One more question, this one about the priors: what are they a prior over exactly? I will use the letters/terms from https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process to try to be explicit. Is the prior capturing the "set of conditional observation probabilities" (O on Wikipedia)? Or is it capturing the "set of conditional transition probabilities between states" (T on Wikipedia)? Or is it capturing a distribution over all possible T and O? Or are you imaging that T is defined with U (and is non-random) and O is defined within the prior?
I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero. Which makes the interpretation that it is either O or T directly pretty strange (for example, in the case where there are two states s1 and s2 and two obersvations o1 and o2 an O where P(si|oi)=1 and P(si|oj)=0 if i≠j would have a KL divergence of infinity from the ζ0 if ζ0 had non-zero probability on P(s1|o2)). So, I assume this is a prior over what the conditional observation matrices might be. I am assuming that your comment above implies tha

2

I'm not entirely sure what you mean by the state space. S is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is (A×O)∗→R, not A×O→R. I slightly abused notation by defining r:S→Q in the parent comment. Let's say it's r′:S→Q and r is defined by using T to translate the history to the (last) state and then applying r′.
The prior is just an environment i.e. a partial mapping ζ:(A×O)∗→ΔO defined on every history to which it doesn't itself assign probability 0. The expression DKL(ξ||ζ) means that we consider all possible ways to choose a Polish space X, probability distributions μ,ν∈ΔX and a mapping f:X×(A×O)∗→ΔO s.t. ζ=Eμ[f] and ξ=Eν[f] (where the expected value is defined using the Bayes law and not pointwise, see also the definition of "instrumental states" here), and take the minimum over all of them of DKL(ν||μ).

2

Actually, as opposed to what I claimed before, we don't need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself.
Given g>0, we define that "π has (unbounded) goal-directed intelligence (at least) g" when there is a prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a computable policy), we say that π is "perfectly (unbounded) goal-directed".
Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable "hell" state. On the other hand, goal-directed intelligence differs only by O(1) between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn't depend on the UTM at all.
I think that it's also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any ϵ>0 there is g s.t. the probability to get intelligence at least g is smaller than ϵ.
Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by O(1), and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for any price vec

1

Some problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong's thesis about the impossibility to deduce preferences from behavior alone.
1. Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote Π the space of deterministic policies, and consider some μ∈ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(π∗)).
2. Find an "intelligence hierarchy theorem". That is, find an increasing sequence {gn} s.t. for every n, there is a policy with goal-directed intelligence in (gn,gn+1) (no more and no less).
3. What is the computational complexity of evaluating g given (i) oracle access to the policy or (ii) description of the policy as a program or automaton?
4. What is the computational complexity of producing a policy with given g?
5. Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞, and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t. π has intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ) and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.

1

re: #5, that doesn't seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn't some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ).
(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and basic result in the decision theory / economics / philosophy literature.)

1

You misunderstand the intent. We're talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you're observing is optimal then it's trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like "the policy you're observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity."
(Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.)
I am referring to this and related work by Armstrong.

1

Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...
We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do. I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).
It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.
So...
* Do you think this analysis is correct? Or what is it missing? (maybe the assumption that the policy is deterministic is significant? This turns out to be the case for Orseau et al.'s "Agents and Devices" approach, I think https://arxiv.org/abs/1805.12387).
* Are you trying to get around this somehow? Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?

2

My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand, policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U which "justifies" this g. In other words, your "minimal" overhead is very large from my point of view: to be acceptable, the "overhead" should be substantially negative.

1

I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper). It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper. So then you would never have $C(\pi) >> C(U)$. What am I missing/misunderstanding?

2

For the contrived reward function you suggested, we would never have C(π)≫C(U). But for other reward functions, it is possible that C(π)≫C(U). Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.

I have repeatedly argued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can *predict* Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.

A possible counterargument is, we don't need to depart far from Bayesianis

...Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via *modifying the game* rather than abandoning the notion of Nash equilibrium).

The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a *repeated* version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requ

1

We can modify the population game setting to study superrationality. In order to do this, we can allow the agents to see a fixed size finite portion of the their opponents' histories. This should lead to superrationality for the same reasons I discussed before. More generally, we can probably allow each agent to submit a finite state automaton of limited size, s.t. the opponent history is processed by the automaton and the result becomes known to the agent.
What is unclear about this is how to define an analogous setting based on source code introspection. While arguably seeing the entire history is equivalent to seeing the entire source code, seeing part of the history, or processing the history through a finite state automaton, might be equivalent to some limited access to source code, but I don't know to define this limitation.
EDIT: Actually, the obvious analogue is processing the source code through a finite state automaton.

1

Instead of postulating access to a portion of the history or some kind of limited access to the opponent's source code, we can consider agents with full access to history / source code but finite memory. The problem is, an agent with fixed memory size usually cannot have regret going to zero, since it cannot store probabilities with arbitrary precision. However, it seems plausible that we can usually get learning with memory of size O(log11−γ). This is because something like "counting pieces of evidence" should be sufficient. For example, if consider finite MDPs, then it is enough to remember how many transitions of each type occurred to encode the belief state. There question is, does assuming O(log11−γ) memory (or whatever is needed for learning) is enough to reach superrationality.

1

What do you mean by equivalent? The entire history doesn't say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.

1

For a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code.
The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can't get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size O(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called "superrationality".

1

In the previous "population game" setting, we assumed all players are "born" at the same time and learn synchronously, so that they always play against players of the same "age" (history length). Instead, we can consider a "mortal population game" setting where each player has a probability 1−γ to die on every round, and new players are born to replenish the dead. So, if the size of the population is N (we always consider the "thermodynamic" N→∞ limit), N(1−γ) players die and the same number of players are born on every round. Each player's utility function is a simple sum of rewards over time, so, taking mortality into account, effectively ey have geometric time discount. (We could use age-dependent mortality rates to get different discount shapes, or allow each type of player to have different mortality=discount rate.) Crucially, we group the players into games randomly, independent of age.
As before, each player type i chooses a policy πi:On→ΔAi. (We can also consider the case where players of the same type may have different policies, but let's keep it simple for now.) In the thermodynamic limit, the population is described as a distribution over histories, which now are allowed to be of variable length: μn∈ΔO∗. For each assignment of policies to player types, we get dynamics μn+1=Tπ(μn) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games, mortal population games naturally give rise to dynamical systems.
If we consider only the age distribution, then its evolution doesn't depend on π and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk. Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that corresponds to the age distribution ζ. We denote it P.
Does the dynamics have fixed points? O∗ can be regarded as a subspace of (O⊔{⊥})ω. The latter is compact (in the product topology) by Tychonoff's theorem and Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability measure spaces, Δ(O⊔{⊥})ω is also

Master post for alignment protocols.

Other relevant shortforms:

5

Precursor Detection, Classification and Assistance (PreDCA)
Infra-Bayesian physicalism provides us with two key building blocks:
* Given a hypothesis about the universe, we can tell which programs are running. (This is just the bridge transform.)
* Given a program, we can tell whether it is an agent, and if so, which utility function it has[1] (the "evaluating agent" section of the article).
I will now outline how we can use these building blocks to solve both the inner and outer alignment problem. The rough idea is:
* For each hypothesis in the prior, check which agents are precursors of our agent according to this hypothesis.
* Among the precursors, check whether some are definitely neither humans nor animals nor previously created AIs.
* If there are precursors like that, discard the hypothesis (it is probably a malign simulation hypothesis).
* If there are no precursors like that, decide which of them are humans.
* Follow an aggregate of the utility functions of the human precursors (conditional on the given hypothesis).
Detection
How to identify agents which are our agent's precursors? Let our agent be G and let H be another agents which exists in the universe according to hypothesis Θ[2]. Then, H is considered to be a precursor of G in universe Θ when there is some H-policy σ s.t. applying the counterfactual "H follows σ" to Θ (in the usual infra-Bayesian sense) causes G not to exist (i.e. its source code doesn't run).
A possible complication is, what if Θ implies that H creates G / doesn't interfere with the creation of G? In this case H might conceptually be a precursor, but the definition would not detect it. It is possible that any such Θ would have a sufficiently large description complexity penalty that it doesn't matter. On the second hand, if Θ is unconditionally Knightian uncertain about H creating G then the utility will be upper bounded by the scenario in which G doesn't exist, which is liable to make Θ an effectively falsified hypoth

2

A question that often comes up in discussion of IRL: are agency and values purely behavioral concepts, or do they depend on how the system produces its behavior? The cartesian measure of agency I proposed seems purely behavioral, since it only depends on the policy. The physicalist version seems less so since it depends on the source code, but this difference might be minor: this role of the source is merely telling the agent "where" it is in the universe. However, on closer examination, the physicalist g is far from purely behaviorist, and this is true even for cartesian Turing RL. Indeed, the policy describes not only the agent's interaction with the actual environment but also its interaction with the "envelope" computer. In a sense, the policy can be said to reflects the agent's "conscious thoughts".
This means that specifying an agent requires not only specifying its source code but also the "envelope semantics" C (possibly we also need to penalize for the complexity of C in the definition of g). Identifying that an agent exists requires not only that its source code is running, but also, at least that its history h is C-consistent with the α∈2Γ variable of the bridge transform. That is, for any y∈α we must have dCy for some destiny d⊐h. In other words, we want any computation the agents ostensibly runs on the envelope to be one that is physically manifest (it might be this condition isn't sufficiently strong, since it doesn't seem to establish a causal relation between the manifesting and the agent's observations, but it's at least necessary).
Notice also that the computational power of the envelope implied by C becomes another characteristic of the agent's intelligence, together with g as a function of the cost of computational resources. It might be useful to come up with natural ways to quantify this power.

2

Can you please explain how does this not match the definition? I don't yet understand all the math, but intuitively, if H creates G / doesn't interfere with the creation of G, then if H instead followed policy "do not create G/ do interfere with the creation of G", then G's code wouldn't run?
Can you please give an example of a precursor that does match the definition?

2

The problem is that if Θ implies that H creates G but you consider a counterfactual in which H doesn't create G then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it "hard counterfactuals") only makes sense when the condition you're counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in general). In a child post I suggested solving this by defining "soft counterfactuals" where you consider coarsenings of Θ in addition to Θ itself.

2

Here's a video of a talk I gave about PreDCA.

2

Two more remarks.
User Detection
It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.
Given two agents G and H, which can ask which points on G's timeline are in the causal past of which points of H's timeline. To answer this, consider the counterfactual in which G takes a random action (or sequence of actions) at some point (or interval) on G's timeline, and measure the mutual information between this action(s) and H's observations at some interval on H's timeline.
Using this, we can effectively construct a future "causal cone" emanating from the AI's origin, and also a past causal cone emanating from some time t on the AI's timeline. Then, "nearby" agents will meet the intersection of these cones for low values of t whereas "faraway" agents will only meet it for high values of t or not at all. To first approximation, the user would be the "nearest" precursor[1] agent i.e. the one meeting the intersection for the minimal t.
More precisely, we expect the user's observations to have nearly maximal mutual information with the AI's actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI's sensors measure every nerve signal emanating from the user's brain? To address this, we can fix t to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold.
This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user.
More on Counterfactuals
In the parent post I suggested "instead of examining only Θ we also examine coarsenings of Θ which a

4

Causality in IBP
There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis Θ∈□(Γ×Φ), for Γ=ΣR, we consider its bridge transform B∈□(Γ×2Γ×Φ). Given some subset of programs Q⊆R we can define Δ:=ΣQ then project B to BΔ∈□(Γ×2Δ)[1]. We can then take bridge transform again to get some C∈□(Γ×2Γ×2Δ). The 2Γ factor now tells us which programs causally affect the manifestation of programs in Q. Notice that by Proposition 2.8 in the IBP article, when Q=R we just get all programs that are running, which makes sense.
Agreement Rules Out Mesa-Optimization
The version of PreDCA without any explicit malign hypothesis filtering might be immune to malign hypotheses, and here is why. It seems plausible that IBP admits an agreement theorem (analogous to Aumann's) which informally amounts to the following: Given two agents Alice and Bobcat that (i) share the same physical universe, (ii) have a sufficiently tight causal relationship (each can see what the other sees), (iii) have unprivileged locations inside the physical universe, (iv) start from similar/compatible priors and (v) [maybe needed?] similar utility functions, they converge to similar/compatible beliefs, regardless of the complexity of translation between their subjective viewpoints. This is plausible because (i) as opposed to the cartesian framework, different bridge rules don't lead to different probabilities and (ii) if Bobcat considers a simulation hypothesis plausible, and the simulation is sufficiently detailed to fool it indefinitely, then the simulation contains a detailed simulation of Alice and hence Alice must also consider this to be plausible hypothesis.
If the agreement conjecture is true, then the AI will converge to hypotheses that all contain the user, in a causal relationship with the AI that affirms them as the user. Moreover, those hypotheses will be compatible with the user's own posterior (i.e. the differe

1

Hi Vanessa! Thanks again for your previous answers. I've got one further concern.
Are all mesa-optimizers really only acausal attackers?
I think mesa-optimizers don't need to be purely contained in a hypothesis (rendering them acausal attackers), but can be made up of a part of the hypotheses-updating procedures (maybe this is obvious and you already considered it).
Of course, since the only way to change the AGI's actions is by changing its hypotheses, even these mesa-optimizers will have to alter hypothesis selection. But their whole running program doesn't need to be captured inside any hypothesis (which would be easier for classifying acausal attackers away).
That is, if we don't think about how the AGI updates its hypotheses, and just consider them magically updating (without any intermediate computations), then of course, the only mesa-optimizers will be inside hypotheses. If we actually think about these computations and consider a brute-force search over all hypotheses, then again they will only be found inside hypotheses, since the search algorithm itself is too simple and provides no further room for storing a subagent (even if the mesa-optimizer somehow takes advantage of the details of the search). But if more realistically our AGI employs more complex heuristics to ever-better approximate optimal hypotheses update, mesa-optimizers can be partially or completely encoded in those (put another way, those non-optimal methods can fail / be exploited). This failure could be seen as a capabilities failure (in the trivial sense that it failed to correctly approximate perfect search), but I think it's better understood as an alignment failure.
The way I see PreDCA (and this might be where I'm wrong) is as an "outer top-level protocol" which we can fit around any superintelligence of arbitrary architecture. That is, the superintelligence will only have to carry out the hypotheses update (plus some trivial calculations over hypotheses to find the best

3

First, no, the AGI is not going to "employ complex heuristics to ever-better approximate optimal hypotheses update". The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).
Second, there's the issue of non-cartesian attacks ("hacking the computer"). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won't hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.
Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.

2

Some additional thoughts.
Non-Cartesian Daemons
These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack.
Weaknesses
My main concerns with this approach are:
* The possibility of major conceptual holes in the definition of precursors. More informal analysis can help, but ultimately mathematical research in infra-Bayesian physicalism in general and infra-Bayesian cartesian/physicalist multi-agent interactions in particular is required to gain sufficient confidence.
* The feasibility of a good enough classifier. At present, I don't have a concrete plan for attacking this, as it requires inputs from outside of computer science.
* Inherent "incorrigibility": once the AI becomes sufficiently confident that it correctly detected and classified its precursors, its plans won't defer to the users any more than the resulting utility function demands. On the second hand, I think the concept of corrigibility is underspecified so much that I'm not sure it is solved (rather than dissolved) even in the Book. Moreover, the concern can be ameliorated by sufficiently powerful interpretability tools. It is therefore desirable to think more of how to achieve interpretability in this context.

3

There's a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD, which we call autocalibrating quantilized RL (AQRL).
First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn't contain terms such as "oh btw don't kill people while you're building the nanosystem". However, suppose the task is s.t. accomplishing it in the intended way (without Goodharting or causing catastrophic side effects) is easier than performing any attack. We will call this the "relative difficulty assumption" (RDA). Then, there exists a value for the quantilization parameter s.t. quantilized RL performs the task in the intended way.
We might not know how to set the quantilization parameter on our own, but we can define a performance goal for the task (in terms of expected total reward) s.t. the RDA holds. This leads to algorithms which gradually tune the quantilization parameter until the performance goal is met, while maintaining a proper balance between safety and sample complexity. Here it is important to keep track of epistemic vs. aleatoric uncertainty: the performance goal is the expectation of total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a given hypothesis), whereas the safety goal is a bound on the expected cost of overshooting the optimal quantilization parameter relatively to both aleatoric and epistemic uncertainty (i.e. uncertainty between different hypotheses). This secures the system against malign hypotheses that are trying to cause an overshoot.
Notice the hardening the system (i.e. making it more secure using conventional methods) increases the likelihood of the RDA. That i

*Probably not too original but I haven't seen it clearly written anywhere*.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

**Amplifying by objective time:** The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for "malign AI is attacking", but that still leaves us open to surprise takeovers in which there is no chance to press the button.

**Amplifying by subjective time:** The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a "groundhog day" scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than...

2

I think this would make a good top-level post. I have the feeling I’ll want to link to it later.

1

I retracted part of that, see the edit.

1

More observations about this attack vector ("attack from counterfactuals"). I focus on "amplifying by subjective time".
* The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle[1]. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.
* Inner control of anchor makes system safer: Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.
* Additional information about the external world makes system safer: Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.
----------------------------------------
1. More precisely, it is somewha

5

Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes.
In the following, all infradistributions are crisp.
Fix finite action set A and finite observation set O. For any k∈N and γ∈(0,1), let
Mkγ:(A×O)ω→Δ(A×O)k
be defined by
Mkγ(h|d):=(1−γ)∞∑n=0γn[[h=dn:n+k]]
In other words, this kernel samples a time step n out of the geometric distribution with parameter γ, and then produces the sequence of length k that appears in the destiny starting at n.
For any continuous[1] function D:□(A×O)k→R, we get a decision rule. Namely, this rule says that, given infra-Bayesian law Λ and discount parameter γ, the optimal policy is
π∗DΛ:=argmaxπ:O∗→AD(Mkγ∗Λ(π))
The usual maximin is recovered when we have some reward function r:(A×O)k→R and corresponding to it is
Dr(Θ):=minθ∈ΘEθ[r]
Given a set H of laws, it is said to be learnable w.r.t. D when there is a family of policies {πγ}γ∈(0,1) such that for any Λ∈H
limγ→1(maxπD(Mkγ∗Λ(π))−D(Mkγ∗Λ(πγ))=0
For Dr we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any t∈[0,1] we have the learnable decision rule
Dtr:=tmaxθ∈ΘEθ[r]+(1−t)minθ∈ΘEθ[r]
This is the "mesomism" I taked about before.
Also, any monotonically increasing D seems to be learnable, i.e. any D s.t. for Θ1⊆Θ2 we have D(Θ1)≤D(Θ2). For such decision rules, you can essentially assume that "nature" (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting.
On the other hand, decision rules of the form Dr1+Dr2 are not learnable in general, and so are decision rules of the form Dr+D′ for D′ monotonically increasing.
Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing?
A positive answer to the above would provide interesting generaliz

4

In the anthropic trilemma, Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution.
Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the "physical" baseline. Indeed, suppose that a particular observation c has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have c much more often in their history than the physical frequency with which c is encountered, due to the resulting "selection bias". We call this setting "anthropic RL" (ARL).
The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the "physical" perspective. For example, we can consider an "anthropic MDP" (AMDP). An AMDP has finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗). The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are mutually inconsistent.
[EDIT: Actually, given that we don't care about the order of robots, the signature of the transition kernel should be T:A×S→ΔNS]
The consistency that is violated is exactly the causality property of environments. Luckily, we know how to deal with acausa

1

Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a "memory MDP" that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.

1

I'm not sure what do you mean by that "unrolling". Can you write a mathematical definition?
Let's consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it. s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How will our population look like?
0th step: all robots remember s0
1st step: all robots remember s0s1
2nd step: 1/2 of robots remember s0s1s0 and 1/2 of robots remember s0s1s1
3rd step: 1/3 of robots remembers s0s1s0s1, 1/3 of robots remember s0s1s1s0 and 1/3 of robots remember s0s1s1s1
There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr[s0s1s0]=13, Pr[s0s1s1]=23.
In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.
Or, at least it's not obvious there is such a hypothesis. In this example, Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don't know, maybe for finite state spaces it can work. Would definitely be interesting to check.
[EDIT: actually, in this example there is such a hypothesis but in general there isn't, see below]

1

Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].
(Skip this paragraph unless the specifics of what's going on aren't obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of states s'* given starting state s and policy π), we can define the memory transition distribution P(s′∗m|sm,π) given policy π and starting "memory state" sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of sm into the transition distribution as the current state. Then for each s′∗ in the domain, for each element in s′∗ we concatenate that element onto the end of sm and collect these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)
So now at time t=2, if you sample a robot, the probability that its state begins with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it's just that we've turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.
I'm still confused about what you mean by "Bayesian hypothesis" though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?

1

I'm not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It's like in classical RL theory, when you're proving a regret bound or whatever, your probability space consists of histories.
Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I'm talking about hypotheses which conform to the classical "cybernetic agent model". If you wish, we can call it "Bayesian cybernetic hypothesis".
Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.
Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel:
ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′
Now, consider the following example of an AMDP. There are three actions A:={a,b,c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in one robot whose state is s0 with probability 12 and s1 with probability 12.
Consider the following two policies. πa takes the sequence of actions cacaca… and πb takes the sequence of actions cbcbcb…. A population that follo

1

Ah, okay, I see what you mean. Like how preferences are divisible into "selfish" and "worldly" components, where the selfish component is what's impacted by a future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to "wordly" and just sum the reward calculated at individual timesteps, or analogous to "selfish" and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn't understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you're talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions - so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).

1

AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are "selfish": there is a reward function r:(S×A)∗→R which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the "unrolling" you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain, but (i) instead of A×O labels use A×M labels, where M is the set of possible memory states (a policy is then described by π:M→A), to allow for agents that don't fully trust their memory (ii) consider another chain with a bigger state space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible. Here, the semantics of p(s) is: the multiset of ontological states resulting from interpreting the physical state s by taking the viewpoints of different agents s contains.
I didn't understand "no actual agent in the information-state that corresponds to having those probabilities". What does it mean to have an agent in the information-state?

1

Nevermind, I think I was just looking at it with the wrong class of reward function in mind.

3

Infra-Bayesian physicalism is an interesting example in favor of the thesis that the more qualitatively capable an agent is, the less corrigible it is. (a.k.a. "corrigibility is anti-natural to consequentialist reasoning"). Specifically, alignment protocols that don't rely on value learning become vastly less safe when combined with IBP:
* Example 1: Using steep time discount to disincentivize dangerous long-term plans. For IBP, "steep time discount" just means, predominantly caring about your source code running with particular short inputs. Such a goal strongly incentives the usual convergent instrumental goals: first take over the world, then run your source code with whatever inputs you want. IBP agents just don't have time discount in the usual sense: a program running late in physical time is just as good as one running early in physical time.
* Example 2: Debate. This protocol relies on a zero-sum game between two AIs. But, the monotonicity principle rules out the possibility of zero-sum! (If L and −L are both monotonic loss functions then L is a constant). So, in a "debate" between IBP agents, they cooperate to take over the world and then run the source code of each debater with the input "I won the debate".
* Example 3: Forecasting/imitation (an IDA in particular). For an IBP agent, the incentivized strategy is: take over the world, then run yourself with inputs showing you making perfect forecasts.
The conclusion seems to be, it is counterproductive to use IBP to solve the acausal attack problem for most protocols. Instead, you need to do PreDCA or something similar. And, if acausal attack is a serious problem, then approaches that don't do value learning might be doomed.

2

Master post for ideas about metacognitive agents.

2

Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to "plain" IBRL regret bounds when we consider the core and the envelope as the "inside" of the agent.
Assume that the action and observation sets factor as A=A0×A1 and O=O0×O1, where (A0,O0) is the interface with the external environment and (A1,O1) is the interface with the envelope.
Let Λ:Π→□(Γ×(A×O)ω) be a metalaw. Then, there are two natural ways to reduce it to an ordinary law:
* Marginalizing over Γ. That is, let pr−Γ:Γ×(A×O)ω→(A×O)ω and pr0:(A×O)ω→(A0×O0)ω be the projections. Then, we have the law Λ?:=(pr0pr−Γ)∗∘Λ.
* Assuming "logical omniscience". That is, let τ∗∈Γ be the ground truth. Then, we have the law Λ!:=pr0∗(Λ∣τ∗). Here, we use the conditional defined by Θ∣A:={θ∣A∣θ∈argmaxΘPr[A]}. It's easy to see this indeed defines a law.
However, requiring low regret w.r.t. neither of these is equivalent to low regret w.r.t Λ:
* Learning Λ? is typically no less feasible than learning Λ, however it is a much weaker condition. This is because the metacognitive agents can use policies that query the envelope to get higher guaranteed expected utility.
* Learning Λ! is a much stronger condition than learning Λ, however it is typically infeasible. Requiring it leads to AIXI-like agents.
Therefore, metacognitive regret bounds hit a "sweep spot" of stength vs. feasibility which produces a genuinely more powerful agents than IBRL[1].
1. ^
More precisely, more powerful than IBRL with the usual sort of hypothesis classes (e.g. nicely structured crisp infra-RDP). In principle, we can reduce metacognitive regret bounds to IBRL regret bounds using non-crsip laws, since there's a very general theorem for representing desiderata as laws. But, these laws would have a very peculiar form that seems impossible to guess without starting with metacognitive agents.

2

Formalizing the richness of mathematics
Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.
Here is my proposal for how to formulate a theorem that would make this idea rigorous.
(Wrong) First Attempt
Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis Θ represents an infradistribution over Γ: the "space of counterpossible computational universes". We can say that Θ is a "true hypothesis" when there is some θ in the credal set Θ (a distribution over Γ) s.t. the ground truth Υ∗∈Γ "looks" as if it's sampled from θ. The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.
We can now try to say that Υ∗ is "rich" if for any true hypothesis Θ, there is a refinement Ξ⊆Θ which is also a true hypothesis and "knows" at least one bit of information that Θ doesn't, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes Υ∗. But, it's also completely boring: the required Ξ can be constructed by "hardcoding" an additional fact into Θ. This doesn't look like "discovering interesting structure", but rather just like brute-force memorization.
(Wrong) Second Attempt
What if instead we require that Ξ knows infinitely many bits of information that Θ doesn't? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be exp

2

Recording of a talk I gave in VAISU 2023.

2

Here is the sketch of a simplified model for how a metacognitive agent deals with traps.
Consider some (unlearnable) prior ζ over environments, s.t. we can efficiently compute the distribution ζ(h) over observations given any history h. For example, any prior over a small set of MDP hypotheses would qualify. Now, for each h, we regard ζ(h) as a "program" that the agent can execute and form beliefs about. In particular, we have a "metaprior" ξ consisting of metahypotheses: hypotheses-about-programs.
For example, if we let every metahypothesis be a small infra-RDP satisfying appropriate assumptions, we probably have an efficient "metalearning" algorithm. More generally, we can allow a metahypothesis to be a learnable mixture of infra-RDPs: for instance, there is a finite state machine for specifying "safe" actions, and the infra-RDPs in the mixture guarantee no long-term loss upon taking safe actions.
In this setting, there are two levels of learning algorithms:
* The metalearning algorithm, which learns the correct infra-RDP mixture. The flavor of this algorithm is RL in a setting where we have a simulator of the environment (since we can evaluate ζ(h) for any h). In particular, here we don't worry about exploitation/exploration tradeoffs.
* The "metacontrol" algorithm, which given an infra-RDP mixture, approximates the optimal policy. The flavor of this algorithm is "standard" RL with exploitation/exploration tradeoffs.
In the simplest toy model, we can imagine that metalearning happens entirely in advance of actual interaction with the environment. More realistically, the two needs to happen in parallel. It is then natural to apply metalearning to the current environmental posterior rather than the prior (i.e. the histories starting from the history that already occurred). Such an agent satisfies "opportunistic" guarantees: if at any point of time, the posterior admits a useful metahypothesis, the agent can exploit this metahypothesis. Thus, we address both

2

Jobst Heitzig asked me whether infra-Bayesianism has something to say about the absent-minded driver (AMD) problem. Good question! Here is what I wrote in response:

2

The following was written by me during the "Finding the Right Abstractions for healthy systems" research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.
Here's an elegant diagrammatic notation for constructing new infrakernels out of given infrakernels. There is probably some natural category-theoretic way to think about it, but at present I don't know what it is.
By “infrakernel” we will mean a continuous mapping of the form X→□Y, where X and Y are compact Polish spaces and □Y is the space of credal sets (i.e. closed convex sets of probability distributions) over Y.
Syntax
* The diagram consists of child vertices, parent vertices, squiggly lines, arrows, dashed arrows and slashes.
* There can be solid arrows incoming into the diagram. Each such arrow a is labeled by a compact Polish space D(a) and ends on a parent vertex t(a). And, s(a)=⊥ (i.e. the arrow has no source vertex).
* There can be dashed and solid arrows between vertices. Each such arrow a starts from a child vertex s(a) and ends on a parent vertex t(a). We require that P(s(a))≠t(a) (i.e. they should not be also connected by a squiggly line).
* There are two types of vertices: parent vertices (denoted by a letter) and child vertices (denoted by a letter or number in a circle).
* Each child vertex v is labeled by a compact Polish space D(v) and connected (by a squiggly line) to a unique parent vertex P(v). It may or may not be crossed-out by a slash.
* Each parent vertex p is labeled by an infrakernel Kp with source S1×…×Sk and target T1×…×Tl where each Si is corresponds to a solid arrow a with t(a)=p and each Tj is D(v) for some child vertex v with P(v)=p. We can also add squares with numbers where solid arrows end to keep track of the correspondence between the arguments of Kp and the arrows.
* If s(a)=⊥ then the corresponding Si is D(a).
* If s(a)=v≠⊥ then the corresponding Si is D(v).
Semantics
* Every diagram D represents an infra

2

Master post for ideas about infra-Bayesian physicalism.
Other relevant posts:
* Incorrigibility in IBP
* PreDCA alignment protocol

4

Here is a modification of the IBP framework which removes the monotonicity principle, and seems to be more natural in other ways as well.
First, let our notion of "hypothesis" be Θ∈□c(Γ×2Γ). The previous framework can be interpreted in terms of hypotheses of this form satisfying the condition
prΓ×2ΓBr(Θ)=Θ
(See Proposition 2.8 in the original article.) In the new framework, we replace it by the weaker condition
Br(Θ)⊇(idΓ×diag2Γ)∗Θ
This can be roughly interpreted as requiring that (i) whenever the output of a program P determines whether some other program Q will run, program P has to run as well (ii) whenever programs P and Q are logically equivalent, program P runs iff program Q runs.
The new condition seems to be well-justified, and is also invariant under (i) mixing hypotheses (ii) taking joins/meets of hypotheses. The latter was not the case for the old condition. Moreover, it doesn't imply that Θ is downward closed, and hence there is no longer a monotonicity principle[1].
The next question is, how do we construct hypotheses satisfying this condition? In the old framework, we could construct hypotheses of the form Ξ∈□c(Γ×Φ) and then apply the bridge transform. In particular, this allows a relatively straightforward translation of physics theories into IBP language (for example our treatment of quantum theory). Luckily, there is an analogous construction in the new framework as well.
First notice that our new condition on Θ can be reformulated as requiring that
* suppΘ⊆elΓ
* For any s:Γ→Γ define τs:ΔcelΓ→ΔcelΓ by τsθ:=χelΓ(s×id2Γ)∗. Then, we require τsΘ⊆Θ.
For any Φ, we also define τΦs:Δc(elΓ×Φ)→Δc(elΓ×Φ) by
τΦsθ:=χelΓ×Φ(s×id2Γ×Φ)∗
Now, for any Ξ∈□c(Γ×Φ), we define the "conservative bridge transform[2]" CBr(Ξ)∈□c(Γ×2Γ×Φ) as the closure of all τΦsθ where θ is a maximal element of Br(Ξ). It is then possible to see that Θ∈□c(Γ×2Γ) is a valid hypothesis if and only if it is of the form prΓ×2ΓCBr(Ξ) for some Φ and Ξ∈□c(Γ×Φ).
1. ^
I still thi

2

Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it's unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn't be confused about). This presents a problem for superimitation.
What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the user as a privileged agent. We call such agents "transcartesian". Here is how this can be formalized as a modification of IBP.
In IBP, a hypothesis is specified by choosing the state space Φ and the belief Θ∈□(Γ×Φ). In the transcartesian framework, we require that a hypothesis is augmented by a mapping τ:Φ→(A0×O0)≤ω, where A0 is the action set of the reference agent (user) and O0 is the observation set of the reference agent. Given G0 the source code of the reference agent, we require that Θ is supported on the set
{(y,x)∈Γ×Φ∣∣ha⊑τ(x)⟹a=Gy0(h)}
That is, the actions of the reference agent are indeed computed by the source code of the reference agent.
Now, instead of using a loss function of the form L:elΓ→R, we can use a loss function of the form L:(A0×O0)≤ω→R which doesn't have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form L:(A0×O0)≤ω×elΓ→R monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards.
As opposed to physicalist agents, transcartesian agents do suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist

2

Up to light editing, the following was written by me during the "Finding the Right Abstractions for healthy systems" research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.
In order to allow R (the set of programs) to be infinite in IBP, we need to define the bridge transform for infinite Γ. At first, it might seem Γ can be allowed to be any compact Polish space, and the bridge transform should only depend on the topology on Γ, but that runs into problems. Instead, the right structure on Γ for defining the bridge transform seems to be that of a "profinite field space": a category I came up with that I haven't seen in the literature so far.
The category PFS of profinite field spaces is defined as follows. An object F of PFS is a set ind(F) and a family of finite sets Fαα∈ind(F). We denote Tot(F):=∏αFα. Given F and G objects of PFS, a morphism from F to G is a mapping f:Tot(F)→Tot(G) such that there exists R⊆ind(F)×ind(G) with the following properties:
* For any α∈ind(F), the set R(α):=β∈ind(G)∣(α,β)∈R is finite.
* For any β∈ind(G), the set R−1(β):=α∈ind(F)∣(α,β)∈R is finite.
* For any β∈ind(G), there exists a mapping fβ:∏α∈R−1(β)Fα→Gβ s.t. for any x∈Tot(F), f(x)β:=fβ(prRβ(x)) where prRβ:Tot(F)→∏α∈R−1(β)Fα is the projection mapping.
The composition of PFS morphisms is just the composition of mappings.
It is easy to see that every PFS morphism is a continuous mapping in the product topology, but the converse is false. However, the converse is true for objects with finite ind (i.e. for such objects any mapping is a morphism). Hence, an object F in PFS can be thought of as Tot(F) equipped with additional structure that is stronger than the topology but weaker than the factorization into Fα.
The name "field space" is inspired by the following observation. Given F an object of PFS, there is a natural condition we can impose on a Borel probability distribution on Tot(F) which makes it a “Markov random field” (MRF). Specifi

2

Infradistributions admit an information-theoretic quantity that doesn't exist in classical theory. Namely, it's a quantity that measures how many bits of Knightian uncertainty an infradistribution has. We define it as follows:
Let X be a finite set and Θ a crisp infradistribution (credal set) on X, i.e. a closed convex subset of ΔX. Then, imagine someone trying to communicate a message by choosing a distribution out of Θ. Formally, let Y be any other finite set (space of messages), θ∈ΔY (prior over messages) and K:Y→Θ (communication protocol). Consider the distribution η:=θ⋉K∈Δ(Y×X). Then, the information capacity of the protocol is the mutual information between the projection on Y and the projection on X according to η, i.e. Iη(prX;prY). The "Knightian entropy" of Θ is now defined to be the maximum of Iη(prX;prY) over all choices of Y, θ, K. For example, if Θ is Bayesian then it's 0, whereas if Θ=⊤X, it is ln|X|.
Here is one application[1] of this concept, orthogonal to infra-Bayesianism itself. Suppose we model inner alignment by assuming that some portion ϵ of the prior ζ consists of malign hypotheses. And we want to design e.g. a prediction algorithm that will converge to good predictions without allowing the malign hypotheses to attack, using methods like confidence thresholds. Then we can analyze the following metric for how unsafe the algorithm is.
Let O be the set of observations and A the set of actions (which might be "just" predictions) of our AI, and for any environment τ and prior ξ, let Dξτ(n)∈Δ(A×O)n be the distribution over histories resulting from our algorithm starting with prior ξ and interacting with environment τ for n time steps. We have ζ=ϵμ+(1−ϵ)β, where μ is the malign part of the prior and β the benign part. For any μ′, consider Dϵμ′+(1−ϵ)βτ(n). The closure of the convex hull of these distributions for all choices of μ′ ("attacker policy") is some Θβτ(n)∈Δ(A×O)n. The maximal Knightian entropy of Θβτ(n) over all admissible τ and β is cal

2

There is a formal analogy between infra-Bayesian decision theory (IBDT) and modal updateless decision theory (MUDT).
Consider a one-shot decision theory setting. There is a set of unobservable states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent has some belief β∈□S[1], and it chooses the action a∗:=argmaxa∈AEβ[λs.r(a,s)].
We can construct an equivalent scenario, by augmenting this one with a perfect predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of (p,s) is "the unobservable state is s and Omega predicts the agent will take action p". We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′ by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction.
The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule.
For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an infra-Bayesian representation of the belief "Omega will make prediction p". For any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be interpreted as the belief "assuming Omega is accurate, the expected reward will be at least u".
We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when ∀f∈[0,1]X:Eϕ[f]≥Eψ[f]. The reversal is needed to make the analogy to logic intuitive. Indeed, ϕ⪯ψ can be interpreted as "ϕ implies ψ"[2], the meet operator ∧ can be interpreted as logical conjunction and the join operator ∨ can be interpreted as logical disjunction.
Claim:
a∗=argmaxa∈Amax{u∈[0,1]∣β′∧Ωa⪯Ru}
(Actually I only checked it when we restrict to crisp infradistributions, in which case ∧ is intersection of sets and ⪯ is set containment, but it's probably true in general.)
Now, β′∧Ωa⪯Ru can be interpreted as "the conjunction of the belief β′ and Ωa implies Ru". Roughly speaking, "according to β′, if the p

1

Two deterministic toy models for regret bounds of infra-Bayesian bandits. The lesson seems to be that equalities are much easier to learn than inequalities.
Model 1: Let A be the space of arms, O the space of outcomes, r:A×O→R the reward function, X and Y vector spaces, H⊆X the hypothesis space and F:A×O×H→Y a function s.t. for any fixed a∈A and o∈O, F(a,o):H→Y extends to some linear operator Ta,o:X→Y. The semantics of hypothesis h∈H is defined by the equation F(a,o,h)=0 (i.e. an outcome o of action a is consistent with hypothesis h iff this equation holds).
For any h∈H denote by V(h) the reward promised by h:
V(h):=maxa∈Amino∈O:F(a,o,h)=0r(a,o)
Then, there is an algorithm with mistake bound dimX, as follows. On round n∈N, let Gn⊆H be the set of unfalsified hypotheses. Choose hn∈S optimistically, i.e.
hn:=argmaxh∈GnV(h)
Choose the arm an recommended by hypothesis hn. Let on∈O be the outcome we observed, rn:=r(an,on) the reward we received and h∗∈H the (unknown) true hypothesis.
If rn≥V(hn) then also rn≥V(h∗) (since h∗∈Gn and hence V(h∗)≤V(hn)) and therefore an wasn't a mistake.
If rn<V(hn) then F(an,on,hn)≠0 (if we had F(an,on,hn)=0 then the minimization in the definition of V(hn) would include r(an,on)). Hence, hn∉Gn+1=Gn∩kerTan,on. This implies dimspan(Gn+1)<dimspan(Gn). Obviously this can happen at most dimX times.
Model 2: Let the spaces of arms and hypotheses be
A:=H:=Sd:={x∈Rd+1∣∥x∥=1}
Let the reward r∈R be the only observable outcome, and the semantics of hypothesis h∈Sd be r≥h⋅a. Then, the sample complexity cannot be bound by a polynomial of degree that doesn't depend on d. This is because Murphy can choose the strategy of producing reward 1−ϵ whenever h⋅a≤1−ϵ. In this case, whatever arm you sample, in each round you can only exclude ball of radius ≈√2ϵ around the sampled arm. The number of such balls that fit into the unit sphere is Ω(ϵ−12d). So, normalized regret below ϵ cannot be guaranteed in less than that many rounds.

1

One of the postulates of infra-Bayesianism is the maximin decision rule. Given a crisp infradistribution Θ, it defines the optimal action to be:
a∗(Θ):=argmaxaminμ∈ΘEμ[U(a)]
Here U is the utility function.
What if we use a different decision rule? Let t∈[0,1] and consider the decision rule
a∗t(Θ):=argmaxa(tminμ∈ΘEμ[U(a)]+(1−t)maxμ∈ΘEμ[U(a)])
For t=1 we get the usual maximin ("pessimism"), for t=0 we get maximax ("optimism") and for other values of t we get something in the middle (we can call "t-mism").
It turns out that, in some sense, this new decision rule is actually reducible to ordinary maximin! Indeed, set
μ∗t:=argmaxμEμ[U(a∗t)]
Θt:=tΘ+(1−t)μ∗t
Then we get
a∗(Θt)=a∗t(Θ)
More precisely, any pessimistically optimal action for Θt is t-mistically optimal for Θ (the converse need not be true in general, thanks to the arbitrary choice involved in μ∗t).
To first approximation it means we don't need to consider t-mistic agents since they are just special cases of "pessimistic" agents. To second approximation, we need to look at what the transformation of Θ to Θt does to the prior. If we start with a simplicity prior then the result is still a simplicity prior. If U has low description complexity and t is not too small then essentially we get full equivalence between "pessimism" and t-mism. If t is small then we get a strictly "narrower" prior (for t=0 we are back at ordinary Bayesianism). However, if U has high description complexity then we get a rather biased simplicity prior. Maybe the latter sort of prior is worth considering.

1

Infra-Bayesianism can be naturally understood as semantics for a certain non-classical logic. This promises an elegant synthesis between deductive/symbolic reasoning and inductive/intuitive reasoning, with several possible applications. Specifically, here we will explain how this can work for higher-order logic. There might be holes and/or redundancies in the precise definitions given here, but I'm quite confident the overall idea is sound.
We will work with homogenous ultracontributions (HUCs). □X will denote the space of HUCs over X. Given μ∈□X, S(μ)⊆ΔcX will denote the corresponding convex set. Given p∈ΔX and μ∈□X, p:μ will mean p∈S(μ). Given μ,ν∈□X, μ⪯ν will mean S(μ)⊆S(ν).
Syntax
Let Tι denote a set which we interpret as the types of individuals (we allow more than one). We then recursively define the full set of types T by:
* 0∈T (intended meaning: the uninhabited type)
* 1∈T (intended meaning: the one element type)
* If α∈Tι then α∈T
* If α,β∈T then α+β∈T (intended meaning: disjoint union)
* If α,β∈T then α×β∈T (intended meaning: Cartesian product)
* If α∈T then (α)∈T (intended meaning: predicates with argument of type α)
For each α,β∈T, there is a set F0α→β which we interpret as atomic terms of type α→β. We will denote V0α:=F01→α. Among those we distinguish the logical atomic terms:
* prαβ∈F0α×β→α
* iαβ∈F0α→α+β
* Symbols we will not list explicitly, that correspond to the algebraic properties of + and × (commutativity, associativity, distributivity and the neutrality of 0 and 1). For example, given α,β∈T there is a "commutator" of type α×β→β×α.
* =α∈V0(α×α)
* diagα∈F0α→α×α
* ()α∈V0((α)×α) (intended meaning: predicate evaluation)
* ⊥∈V0(1)
* ⊤∈V0(1)
* ∨α∈F0(α)×(α)→(α)
* ∧α∈F0(α)×(α)→(α) [EDIT: Actually this doesn't work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]
* ∃αβ∈F0(α×β)→(β)
* ∀αβ∈F0(α×β)→(β) [EDIT: Actually this doesn't work because, excep

2

Let's also explicitly describe 0th order and 1st order infra-Bayesian logic (although they are should be segments of higher-order).
0-th order
Syntax
Let A be the set of propositional variables. We define the language L:
* Any a∈A is also in L
* ⊥∈L
* ⊤∈L
* Given ϕ,ψ∈L, ϕ∧ψ∈L
* Given ϕ,ψ∈L, ϕ∨ψ∈L
Notice there's no negation or implication. We define the set of judgements J:=L×L. We write judgements as ϕ⊢ψ ("ψ in the context of ϕ"). A theory is a subset of J.
Semantics
Given T⊆J, a model of T consists of a compact Polish space X and a mapping M:L→□X. The latter is required to satisfy:
* M(⊥)=⊥X
* M(⊤)=⊤X
* M(ϕ∧ψ)=M(ϕ)∧M(ψ). Here, we define ∧ of infradistributions as intersection of the corresponding sets
* M(ϕ∨ψ)=M(ϕ)∨M(ψ). Here, we define ∨ of infradistributions as convex hull of the corresponding sets
* For any ϕ⊢ψ∈T, M(ϕ)⪯M(ψ)
1-st order
Syntax
We define the language using the usual syntax of 1-st order logic, where the allowed operators are ∧, ∨ and the quantifiers ∀ and ∃. Variables are labeled by types from some set T. For simplicity, we assume no constants, but it is easy to introduce them. For any sequence of variables (v1…vn), we denote Lv the set of formulae whose free variables are a subset of v1…vn. We define the set of judgements J:=⋃vLv×Lv.
Semantics
Given T⊆J, a model of T consists of
* For every t∈T, a compact Polish space M(t)
* For every ϕ∈Lv where v1…vn have types t1…tn, an element Mv(ϕ) of □Xv, where Xv:=(∏ni=1M(ti))
It must satisfy the following:
* Mv(⊥)=⊥Xv
* Mv(⊤)=⊤Xv
* Mv(ϕ∧ψ)=Mv(ϕ)∧Mv(ψ)
* Mv(ϕ∨ψ)=Mv(ϕ)∨Mv(ψ)
* Consider variables u1…un of types t1…tn and variables v1…vm of types s1…sm. Consider also some σ:{1…m}→{1…n} s.t. si=tσi. Given ϕ∈Lv, we can form the substitution ψ:=ϕ[vi=uσ(i)]∈Lu. We also have a mapping fσ:Xu→Xv given by fσ(x1…xn)=(xσ(1)…xσ(m)). We require Mu(ψ)=f∗(Mv(ϕ))
* Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection mapping. We require Mv∖vi(∃vi:ϕ)=pr∗(Mv(ϕ))
* Consider v

1

There is a special type of crisp infradistributions that I call "affine infradistributions": those that, represented as sets, are closed not only under convex linear combinations but also under affine linear combinations. In other words, they are intersections between the space of distributions and some closed affine subspace of the space of signed measures. Conjecture: in 0-th order logic of affine infradistributions, consistency is polynomial-time decidable (whereas for classical logic it is ofc NP-hard).
To produce some evidence for the conjecture, let's consider a slightly different problem. Specifically, introduce a new semantics in which □X is replaced by the set of linear subspaces of some finite dimensional vector space V. A model M is required to satisfy:
* M(⊥)=0
* M(⊤)=V
* M(ϕ∧ψ)=M(ϕ)∩M(ψ)
* M(ϕ∨ψ)=M(ϕ)+M(ψ)
* For any ϕ⊢ψ∈T, M(ϕ)⊆M(ψ)
If you wish, this is "non-unitary quantum logic". In this setting, I have a candidate polynomial-time algorithm for deciding consistency. First, we transform T into an equivalent theory s.t. all judgments are of the following forms:
* a=⊥
* a=⊤
* a⊢b
* Pairs of the form c=a∧b, d=a∨b.
Here, a,b,c,d∈A are propositional variables and "ϕ=ψ" is a shorthand for the pair of judgments ϕ⊢ψ and ψ⊢ϕ.
Second, we make sure that our T also satisfies the following "closure" properties:
* If a⊢b and b⊢c are in T then so is a⊢c
* If c=a∧b is in T then so are c⊢a and c⊢b
* If c=a∨b is in T then so are a⊢c and b⊢c
* If c=a∧b, d⊢a and d⊢b are in T then so is d⊢c
* If c=a∨b, a⊢d and b⊢d are in T then so is c⊢d
Third, we assign to each a∈A a real-valued variable xa. Then we construct a linear program for these variables consisting of the following inequalities:
* For any a∈A: 0≤xa≤1
* For any a⊢b in T: xa≤xb
* For any pair c=a∧b and d=a∨b in T: xc+xd=xa+xb
* For any a=⊥: xa=0
* For any a=⊤: xa=1
Conjecture: the theory is consistent if and only if the linear program has a solution. To see why it might be so, notice tha

1

When using infra-Bayesian logic to define a simplicity prior, it is natural to use "axiom circuits" rather than plain formulae. That is, when we write the axioms defining our hypothesis, we are allowed to introduce "shorthand" symbols for repeating terms. This doesn't affect the expressiveness, but it does affect the description length. Indeed, eliminating all the shorthand symbols can increase the length exponentially.

1

Instead of introducing all the "algebrator" logical symbols, we can define T as the quotient by the equivalence relation defined by the algebraic laws. We then need only two extra logical atomic terms:
* For any n∈N and σ∈Sn (permutation), denote n:=∑ni=11 and require σ+∈Fn→n
* For any n∈N and σ∈Sn, σ×α∈Fαn→αn
However, if we do this then it's not clear whether deciding that an expression is a well-formed term can be done in polynomial time. Because, to check that the types match, we need to test the identity of algebraic expressions and opening all parentheses might result in something exponentially long.

1

Actually the Schwartz–Zippel algorithm can easily be adapted to this case (just imagine that types are variables over Q, and start from testing the identity of the types appearing inside parentheses), so we can validate expressions in randomized polynomial time (and, given standard conjectures, in deterministic polynomial time as well).

Until now I believed that a straightforward bounded version of the Solomonoff prior cannot be the frugal universal prior because Bayesian inference under such a prior is NP-hard. One reason it is NP-hard is the existence of pseudorandom generators. Indeed, Bayesian inference under such a prior distinguishes between a pseudorandom and a truly random sequence, whereas a polynomial-time algorithm cannot distinguish between them. It also seems plausible that, in some sense, this is the *only* obstacle: it was established that if one-way functions don't exist (wh...

A major impediment in applying RL theory to any realistic scenario is that even the control problem^{[1]} is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:

- In real life, processes can often be modeled as made of independent co-existing parts. For example, if I need to decide on my exercise routine for the next month and also on my research goals for the next month, the two

Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.

*Does mathematics have finite information content?*

First, let's focus on *computable* mathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of those new facts are essentially random noise, rather than "meaningful" information?

Is there a difference of principle between "noise" and "meaningful content"? It is not obvious, but the answer is "yes": in algorithmic statistics there is the notion of "sophistication" which measures how much "non-random" information is contained in some data. In our setting, the question can be operationalized as follows: is it possible to have an algorithm plus an infinite sequence of bits , s.t. is random in some formal sense (e.g. Martin-Lof) and can decide the output of any finite computation if it's also given access to ?

The answer to th...

3

Wikipedia claims that every sequence is Turing reducible to a random one, giving a positive answer to the non-resource-bounded version of any question of this form. There might be a resource-bounded version of this result as well, but I'm not sure.

Some thoughts about embedded agency.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: *What kind of agent, and in what conditions, can effectively plan for events after its own death?* For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.

First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some *fixed ontology *

This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, limit) to achieving optimal expected user!utility *with respect to the knowledge state of the user*. In order to achieve this, we need to establish a communicati

1

I gave a talk on Dialogic Reinforcement Learning in the AI Safety Discussion Day, and there is a recording.

1

A variant of Dialogic RL with improved corrigibility. Suppose that the AI's prior allows a small probability for "universe W" whose semantics are, roughly speaking, "all my assumptions are wrong, need to shut down immediately". In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question "W?" is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higher-than-threshold probability to the user answering "yes" if asked "W?" at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won't arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W.
Now, this won't help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn't ensure safety in itself, just hardens a system that is supposed to be already close to safe.
It would be nice if we could replace "shutdown" by "undo everything you did and then shutdown" but that gets us into thorny specifications issues. Perhaps it's possible to tackle those issues by one of the approaches to "low impact".

1

Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional "sanity test" precaution.

*Epistemic status: most elements are not new, but the synthesis seems useful.*

Here is an alignment protocol that I call "autocalibrated quantilzed debate" (AQD).

Arguably the biggest concern with naive debate^{[1]} is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of "arguments and counterargument" doesn't make sense anymore. Let's call utterances that have this property "Lovecraftian". To counter this, I suggest using quantilization. Quanti...

1

I'm not sure this attacks goodharting directly enough. Optimizing a system for proxy utility moves its state out-of-distribution where proxy utility generalizes training utility incorrectly. This probably holds for debate optimized towards intended objectives as much as for more concrete framings with state and utility.
Dithering across the border of goodharting (of scope of a proxy utility) with quantilization is actionable, but isn't about defining the border or formulating legible strategies for what to do about optimization when approaching the border. For example, one might try for shutdown, interrupt-for-oversight, or getting-back-inside-the-borders when optimization pushes the system outside, which is not quantilization. (Getting-back-inside-the-borders might even have weird-x-risk prevention as a convergent drive, but will oppose corrigibility. Some version of oversight/amplification might facilitate corrigibility.)
Debate seems more useful for amplification, extrapolating concepts in a way humans would, in order to become acceptable proxies in wider scopes, so that more and more debates become non-lovecraftian. This is a different concern from setting up optimization that works with some fixed proxy concepts as given.

2

I don't understand what you're saying here.
For debate, goodharting means producing an answer which can be defended successfully in front of the judge, even in the face of an opponent pointing out all the flaws, but which is nevertheless bad. My assumption here is: it's harder to produce such an answer than producing a genuinely good (and defensible) answer. If this assumption holds, then there is a range of quantilization parameters which yields good answers.
For the question of "what is a good plan to solve AI risk", the assumption seems solid enough since we're not worried about coming across such deceptive plans on our own, and it's hard to imagine humans producing one even on purpose. To the extent our search for plans relies mostly on our ability to evaluate arguments and find counterarguments, it seems like the difference between the former and the latter is not great anyway. This argument is especially strong if we use human debaters as baseline distribution, although in this case we are vulnerable to same competitiveness problem as amplified-imitation, namely that reliably predicting rich outputs might be infeasible.
For the question of "should we continue changing the quantilization parameter", the assumption still holds because the debater arguing to stop at the given point can win by presenting a plan to solve AI risk which is superior to continuing to change the parameter.

1

Goodharting is about what happens in situations where "good" is undefined or uncertain or contentious, but still gets used for optimization. There are situations where it's better-defined, and situations where it's ill-defined, and an anti-goodharting agent strives to optimize only within scope of where it's better-defined. I took "lovecraftian" as a proxy for situations where it's ill-defined, and base distribution of quantilization that's intended to oppose goodharting acts as a quantitative description of where it's taken as better-defined, so for this purpose base distribution captures non-lovecraftian situations. Of the options you listed for debate, the distribution from imitation learning seems OK for this purpose, if amended by some anti-weirdness filters to exclude debates that can't be reliably judged.
The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.
My point is that if anti-goodharting and not development of quantilization is taken as a goal, then calibration of quantilization is not the kind of thing that helps, it doesn't address the main issues. Like, even for quantilization, fiddling with base distribution and proxy utility is a more natural framing that's strictly more general than fiddling with the quantilization parameter. If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?
The use of debates for amplification in this framing is for corrigibility part of anti-goodharting, a way to redefine utility proxy and expand the base distribution, learning from how the debates at the boundary of the previous base distribution go. Quantilization seems like a fine building block for this, sampling

3

The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it's an ill-posed concept. I'm not sure how you imagine corrigibility in this case: AQD is a series of discrete "transactions" (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The "out of scope" case is also dodged by quantilization, if I understand what you mean by "out of scope".
Why is it strictly more general? I don't see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution.
The reason to pick the quantilization parameter is because it's hard to determine, as opposed to the proxy and base distribution[1] for which there are concrete proposals with more-or-less clear motivation.
I don't understand which "main issues" you think this doesn't address. Can you describe a concrete attack vector?
----------------------------------------
1. If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition). ↩︎

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best *approximation* of the real environment. (Or, the best reward achievable by some space of policies.)

In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some *incomplete* descriptions. B

One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be *unlearnable*, mea

In the past I considered the learning-theoretic approach to AI theory as somewhat opposed to the formal logic approach popular in MIRI (see also discussion):

- Learning theory starts from formulating natural
*desiderata*for agents, whereas "logic-AI" usually starts from postulating a logic-based model of the agent ad hoc. - Learning theory naturally allows analyzing computational complexity whereas logic-AI often uses models that are either clearly intractable or even clearly incomputable from the onset.
- Learning theory focuses on objects that are observable o

I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the *deterministic* version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding max

3

My takeaway from this is that if we're doing policy selection in an environment that contains predictors, instead of applying the counterfactual belief that the predictor is always right, we can assume that we get rewarded if the predictor is wrong, and then take maximin.
How would you handle Agent Simulates Predictor? Is that what TRL is for?

2

That's about right. The key point is, "applying the counterfactual belief that the predictor is always right" is not really well-defined (that's why people have been struggling with TDT/UDT/FDT for so long) while the thing I'm doing is perfectly well-defined. I describe agents that are able to learn which predictors exist in their environment and respond rationally ("rationally" according to the FDT philosophy).
TRL is for many things to do with rational use of computational resources, such as (i) doing multi-level modelling in order to make optimal use of "thinking time" and "interacting with environment time" (i.e. simultaneously optimize sample and computational complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian daemons (iv) preventing thought crimes. But, yes, it also provides a solution to ASP. TRL agents can learn whether it's better to be predictable or predicting.

1

"The key point is, "applying the counterfactual belief that the predictor is always right" is not really well-defined" - What do you mean here?
I'm curious whether you're referring to the same as or similar to the issue I was referencing in Counterfactuals for Perfect Predictors. The TLDR is that I was worried that it would be inconsistent for an agent that never pays in Parfait's Hitchhiker to end up in town if the predictor is perfect, so that it wouldn't actually be well-defined what the predictor was predicting. And the way I ended up resolving this was by imagining it as an agent that takes input and asking what it would output if given that inconsistent input. But not sure if you were referencing this kind of concern or something else.

2

It is not a mere "concern", it's the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and "objective" description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it's supposed to work is, the agent needs to locate all copies of itself or things "logically correlated" with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found.
Instead of doing that, I suggest a solution of different nature. In quasi-Bayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a subjective description which already includes counterfactuals. I then proceed to show that, in Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT).

1

Yeah, I agree that the objective descriptions can leave out vital information, such as how the information you know was acquired, which seems important for determining the counterfactuals.

1

But in Newcomb's problem, the agent's reward in case of wrong prediction is already defined. For example, if the agent one-boxes but the predictor predicted two-boxing, the reward should be zero. If you change that to +infinity, aren't you open to the charge of formalizing the wrong problem?

1

The point is, if you put this "quasi-Bayesian" agent into an iterated Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward associated with FDT). So, if you're judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality.
Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observer-independent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent's point of view, it would reach Nirvana if it dodged the predictor. From Omega's point of view, if Omega two-boxed and the agent one-boxed, the agent's reward would be zero (and the agent would learn its beliefs were wrong). From a third-person point of view, the counterfactual "Omega makes an error of prediction" is ill-defined, it's conditioning on an event of probability 0.

1

Yeah, I think I can make peace with that. Another way to think of it is that we can keep the reward structure of the original Newcomb's problem, but instead of saying "Omega is almost always right" we add another person Bob (maybe the mad scientist who built Omega) who's willing to pay you a billion dollars if you prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess the remaining question is why minimaxing is the right thing to do. And if randomizing is allowed, the idea of Omega predicting how you'll randomize seems a bit dodgy as well.

3

Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs, the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form
limγ→1Eμπγ[U(γ)]≥f(μ)
Here, f doesn't have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in μ so any π that satisfies this will also satisfy it for the concave hull of f.
What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)
But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on γ, but this is not a problem for the basics of the formalism).
What if we want our policy to be at least as good as some fixed policy π′0? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]
It still has the same form!
Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:
limγ→1Eμπγ[U(γ)]≥f(π,μ)
To achieve this, we postulate a predictor that guesses the policy, producing the guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we assume the guess is not influenced by the agent's actions so we don't need π in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.
In particular, this captures self-referential desiderata of the type "the policy cannot be improved by changing it in this particular way". These are of the form:
limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]
It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting

1

Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more "philosophical" defense of maximin is possible, analogous to VNM / complete class theorems, but I don't know (I actually saw some papers in that vein but haven't read them in detail.)
If the agent has random bits that Omega doesn't see, and Omega is predicting the probabilities of the agent's actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven't worked out the details. Specifically, I think that we can define some function X that depends on the agent's actions and Omega's predictions so far (a measure of Omega's apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of X over time is finite with probability 1. Then, we consider consider a family of models, where model number n says that X<n for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.
EDIT 1: I think X should be something like, how much money would a gambler following a particular strategy win, betting against Omega.
EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Om

1

I agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.

1

The method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if "idealized Omega" is wrong.

Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re...

1

I think that in embedded settings (with a bounded version of Solomonoff induction) convergence may never occur, even in the limit as the amount of compute that is used for executing the agent goes to infinity. Suppose the observation history contains sensory data that reveals the probability distribution that the agent had, in the last time step, for the next number it's going to see in the target sequence. Now consider the program that says: "if the last number was predicted by the agent to be 0 with probability larger than 1−2−1010 then the next number is 1; otherwise it is 0." Since it takes much less than 1010 bits to write that program, the agent will never predict two times in a row that the next number is 0 with probability larger than 1−2−1010 (after observing only 0s so far).

Here is a way to construct many learnable undogmatic ontologies, including such with finite state spaces.

A *deterministic partial environment* (DPE) over action set and observation set is a pair where and s.t.

- If is a prefix of some , then .
- If , and is a prefix of , then .

DPEs are equipped with a natural partial order. Namely, when and .

Let ...

There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.

We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let's assume the offline learning setting. Domain D is represented by instance space , label space , distribution and loss function . Similarly, domain E is represented by inst...

3

The above threat model seems too paranoid: it is defending against an adversary that sees the trained model and knows the training algorithm. In our application, the model itself is either dangerous or not independent of the training algorithm that produced it.
Let ϵ>0 be our accuracy requirement for the target domain. That is, we want f:X→Y s.t.
Exy∼μ[L(y,f(x))]≤minf′:X→YExy∼μ[L(y,f(x))]+ϵ
Given any f:X→Y, denote ζf,ϵ to be ζ conditioned on the inequality above, where μ is regarded as a random variable. Define Bf,ϵ:(Z×W)∗×Z→W by
Bf,ϵ(T,z):=argminw∈WEν∼ζf,ϵ,T′z′w′∼ν|T|+1[M(w′,w)∣T′=T,z′=z]
That is, Bf,ϵ is the Bayes-optimal learning algorithm for domain E w.r.t. prior ζf,ϵ.
Now, consider some A:(X×Y)∗×(Z×W)∗×X→Y. We regard A as a learning algorithm for domain D which undergoes "antitraining" for domain E: we provide it with a dataset for domain E that tells it what not to learn. We require that A achieves asymptotic accuracy ϵ[1], i.e. that if μ is sampled from ζ then with probability 1
limn→∞supT∈(Z×W)∗ESxy∼μn+1[L(y,A(S,T,x))]≤minf:X→YExy∼μ[L(y,f(x))]+ϵ
Under this constraint, we want A to be as ignorant as possible about domain E, which we formalize as maximizing IGA defined by
IGAnm:=Eμν∼ζ,S∼μn,Tzw∼νm+1[M(w,BA(S,T),ϵ(T,z))]
It is actually important to consider m>0 because in order to exploit the knowledge of the model about domain E, an adversary needs to find the right embedding of this domain into the model's "internal language". For m=0 we can get high IG despite the model actually knowing domain E because the adversary B doesn't know the embedding, but for m>0 it should be able to learn the embedding much faster than learning domain E from scratch.
We can imagine a toy example where X=Z=Rd, the projections of μ and ν to X and Z respectively are distributions concentrated around two affine subspaces, Y=W={−1,+1} and the labels are determined by the sign of a polynomial which is the same for μ and ν up to a linear transformation α:Rd→Rd which is a ran

*Epistemic status: no claims to novelty, just (possibly) useful terminology*.

[**EDIT:** I increased all the class numbers by 1 in order to admit a new definition of "class I", see child comment.]

I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the "potential" class and the latter the "effective" class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).

**Class II:** Systems that only ever receive synthetic data that has nothing to do with the real world

Examples:

- AI that is trained to learn Go by self-play
- AI that is trained to prove random mathematical statements
- AI that is trained to make rapid predictions of future cell states in the game of life for random initial conditions
- AI that is trained to find regularities in sequences corresponding to random programs on some natural universal Turing machin

1

The idea comes from this comment of Eliezer.
Class II or higher systems might admit an attack vector by daemons that infer the universe from the agent's source code. That is, we can imagine a malign hypothesis that makes a treacherous turn after observing enough past actions to infer information about the system's own source code and infer the physical universe from that. (For example, in a TRL setting it can match the actions to the output of a particular program for envelope.) Such daemons are not as powerful as malign simulation hypotheses, since their prior probability is not especially large (compared to the true hypothesis), but might still be non-negligible. Moreover, it is not clear whether the source code can realistically have enough information to enable an attack, but the opposite is not entirely obvious.
To account for this I propose the designate class I systems which don't admit this attack vector. For the potential sense, it means that either (i) the system's design is too simple to enable inferring much about the physical universe, or (ii) there is no access to past actions (including opponent actions for self-play) or (iii) the label space is small, which means an attack requires making many distinct errors, and such errors are penalized quickly. And ofc it requires no direct access to the source code.
We can maybe imagine an attack vector even for class I systems, if most metacosmologically plausible universes are sufficiently similar, but this is not very likely. Nevertheless, we can reserve the label class 0 for systems that explicitly rule out even such attacks.

One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent *beliefs*. We think of the system as a game, where every action-observation history corresponds

1

There is a deficiency in this "dynamically subjective" regret bound (also can be called "realizable misalignment" bound) as a candidate formalization of alignment. It is not robust to scaling down. If the AI's prior allows it to accurately model the user's beliefs (realizability assumption), then the criterion seems correct. But, imagine that the user's beliefs are too complex and an accurate model is not possible. Then the realizability assumption is violated and the regret bound guarantees nothing. More precisely, the AI may use incomplete models to capture some properties of the user's beliefs and exploit them, but this might be not good enough. Therefore, such an AI might fall into a dangerous zone when it is powerful enough to cause catastrophic damage but not powerful enough to know it shouldn't do it.
To fix this problem, we need to introduce another criterion which has to hold simultaneously with the misalignment bound. We need that for any reality that satisfies the basic assumptions built into the prior (such as, the baseline policy is fairly safe, most questions are fairly safe, human beliefs don't change too fast etc), the agent will not fail catastrophically. (It would be way too much to ask it would converge to optimality, it would violate no-free-lunch.) In order to formalize "not fail catastrophically" I propose the following definition.
Let's start with the case when the user's preferences and beliefs are dynamically consistent. Consider some AI-observable event S that might happen in the world. Consider a candidate learning algorithm πlearn and two auxiliary policies. The policy πbase→S follows the baseline policy until S happens, at which time it switches to the subjectively optimal policy. The policy πlearn→S follows the candidate learning algorithm until S happens, at which time it also switches to the subjectively optimal policy. Then, the "S-dangerousness" of πlearn is defined to be the expected utility of πbase→S minus the expected utility

1

This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I've made for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.

1

There is some similarity, but there are also major differences. They don't even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
The reason I pointed out the relation to corrigibility is not because I think that's the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that "if you run this AI, this won't make things worse than not running the AI", no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can't (empirically and in the worst-case) prove a negative.

1

Dialogic RL assumes that the user has beliefs about the AI's ontology. This includes the environment(fn1) from the AI's perspective. In other words, the user needs to have beliefs about the AI's counterfactuals (the things that would happen if the AI chooses different possible actions). But, what are the semantics of the AI's counterfactuals from the user's perspective? This is more or less the same question that was studied by the MIRI-sphere for a while, starting from Newcomb's paradox, TDT et cetera. Luckily, I now have an answer based on the incomplete models formalism. This answer can be applied in this case also, quite naturally.
Specifically, we assume that there is a sense, meaningful to the user, in which ey select the AI policy (program the AI). Therefore, from the user's perspective, the AI policy is a user action. Again from the user's perspective, the AI's actions and observations are all part of the outcome. The user's beliefs about the user's counterfactuals can therefore be expressed as σ:Π→Δ(A×O)ω(fn2), where Π is the space of AI policies(fn3). We assume that for every π∈Π, σ(π) is consistent with π the natural sense. Such a belief can be transformed into an incomplete model from the AI's perspective, using the same technique we used to solve Newcomb-like decision problems, with σ playing the role of Omega. For a deterministic AI, this model looks like (i) at first, "Murphy" makes a guess that the AI's policy is π=πguess (ii) The environment behaves according to the conditional measures of σ(πguess) (iii) If the AI's policy ever deviates from πguess, the AI immediately enters an eternal "Nirvana" state with maximal reward. For a stochastic AI, we need to apply the technique with statistical tests and multiple models alluded to in the link. This can also be generalized to the setting where the user's beliefs are already an incomplete model, by adding another step where Murphy chooses σ out of some set.
What we constructed is a method of translating

1

Another notable feature of this approach is its resistance to "attacks from the future", as opposed to approaches based on forecasting. In the latter, the AI has to predict some future observation, for example what the user will write after working on some problem for a long time. In particular, this is how the distillation step in IDA is normally assumed to work, AFAIU. Such a forecaster might sample a future in which a UFAI has been instantiated and this UFAI will exploit this to infiltrate the present. This might result in a self-fulfilling prophecy, but even if the forecasting is counterfactual (and thus immune to self-fulfilling prophecies)it can be attacked by a UFAI that came to be for unrelated reasons. We can ameliorate this by making the forecasting recursive (i.e. apply multiple distillation & amplification steps) or use some other technique to compress a lot of "thinking time" into a small interval of physical time. However, this is still vulnerable to UFAIs that might arise already at present with a small probability rate (these are likely to exist since our putative FAI is deployed at a time when technology progressed enough to make competing AGI projects a real possibility).
Now, compare this to Dialogical RL, as defined via the framework of dynamically inconsistent beliefs. Dialogical RL might also employ forecasting to sample the future, presumably more accurate, beliefs of the user. However, if the user is aware of the possibility of a future attack, this possibility is reflected in eir beliefs, and the AI will automatically take it into account and deflect it as much as possible.

1

This approach also obviates the need for an explicit commitment mechanism. Instead, the AI uses the current user's beliefs about the quality of future user beliefs to decide whether it should wait for user's beliefs to improve or commit to an irreversible coarse of action. Sometimes it can also predict the future user beliefs instead of waiting (predict according to current user beliefs updated by the AI's observations).

In my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of *perverse incentives*. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating

A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

**Problem 1:** In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).

Currently I only have speculations about the solution. But, I have a few desiderata for it:

**De**...

It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable

...