I propose to call metacosmology the hypothetical field of study which would be concerned with the following questions:

Studying the space of simple mathematical laws which produce counterfactual universes with intelligent life.

Studying the distribution over utility-function-space (and, more generally, mindspace) of those counterfactual minds.

Studying the distribution of the amount of resources available to the counterfactual civilizations, and broad features of their development trajectories.

Using all of the above to produce a distribution over concretized simulation hypotheses.

This concept is of potential interest for several reasons:

It can be beneficial to actually research metacosmology, in order to draw practical conclusions. However, knowledge of metacosmology can pose an infohazard, and we would need to precommit not to accept blackmail from potential simulators.

The metacosmology knowledge of a superintelligent AI determines the extent to which it poses risk via the influence of potential simulators.

In principle, we might be able to use knowledge of metacosmology in order to engineer an "atheist prior" for the AI that would exclude simulation hypotheses. However, this might be very difficult in practice.

I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user's beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).

More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user's policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user's subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI's prior over universes and ϵ... (read more)

5Steve Byrnes5mo(Update: I don't think this was 100% right, see here
[https://www.lesswrong.com/posts/SzrmsbkqydpZyPuEh/my-take-on-vanessa-kosoy-s-take-on-agi-safety#4_1_Example___the_Hippocratic_principle__desideratum__and_an_algorithm_that_obeys_it]
for a better version.)
Attempted summary for morons like me: AI is trying to help the human H. They
share access to a single output channel, e.g. a computer keyboard, so that the
actions that H can take are exactly the same as the actions AI can take. Every
step, AI can either take an action, or delegate to H to take an action. Also,
every step, H reports her current assessment of the timeline / probability
distribution for whether she'll succeed at the task, and if so, how soon.
At first, AI will probably delegate to H a lot, and by watching H work, AI will
gradually learn both the human policy (i.e. what H tends to do in different
situations), and how different actions tend to turn out in hindsight from H's
own perspective (e.g., maybe whenever H takes action 17, she tends to declare
shortly afterwards that probability of success now seems much higher than
before—so really H should probably be taking action 17 more often!).
Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate
how different actions will turn out from H's perspective much better than H
herself. In other words, maybe it delegates to H, and H takes action 41, and the
AI is watching this and shaking its head and thinking to itself "gee you dunce
you're gonna regret that", and shortly thereafter the AI is proven correct.
OK, so now what? The naive answer would be: the AI should gradually stop
delegating and start just doing the thing that leads to H feeling maximally
optimistic later on.
But we don't want to do that naive thing. There are two problems:
The first problem is "traps" (a.k.a. catastrophes). Let's say action 0 is Press
The History Eraser Button [https://vimeo.com/126720159]. H never takes that
action. The AI shouldn't either.

2Vanessa Kosoy5moThis is about right.
Notice that typically we use the AI for tasks which are hard for H. This means
that without the AI's help, H's probability of success will usually be low.
Quantilization-wise, this is a problem: the AI will be able to eliminate those
paths for which H will report failure, but maybe most of the probability mass
among apparent-success paths is still on failure (i.e. the success report is
corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don't expect to fail
soon. Therefore, the AI can safely consider a policies of the form "in the
short-term, do something H would do with marginal probability, in the long-term
go back to H's policy". If by the end of the short-term maneuver H reports an
improved prognosis, this can imply that the improvement is genuine (since the AI
knows H is probably uncorrupted at this point). Moreover, it's possible that in
the new prognosis H still doesn't expect to fail soon. This allows performing
another maneuver of the same type. This way, the AI can iteratively steer the
trajectory towards true success.

4Alex Turner1moThe Hippocratic principle seems similar to my concept of non-obstruction (
https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility
[https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility]
), but subjective from the human's beliefs instead of the AI's.

2Vanessa Kosoy23dYes, there is some similarity! You could say that a Hippocratic AI needs to be
continuously non-obstructive w.r.t. the set of utility functions and priors the
user could plausibly have, given what the AI knows. Where, by "continuously" I
mean that we are allowed to compare keeping the AI on or turning off at any
given moment.

2Vanessa Kosoy2mo"Corrigibility" is usually defined as the property of AIs who don't resist
modifications by their designers. Why would we want to perform such
modifications? Mainly it's because we made errors in the initial implementation,
and in particular the initial implementation is not aligned. But, this leads to
a paradox: if we assume our initial implementation to be flawed in a way that
destroys alignment, why wouldn't it also be flawed in a way that destroys
corrigibility?
In order to stop passing the recursive buck, we must assume some dimensions
along which our initial implementation is not allowed to be flawed. Therefore,
corrigibility is only a well-posed notion in the context of a particular such
assumption. Seen through this lens, the Hippocratic principle becomes a
particular crystallization of corrigibility. Specifically, the Hippocratic
principle assumes the agent has access to some reliable information about the
user's policy and preferences (be it through timelines, revealed preferences or
anything else).
Importantly, this information can be incomplete, which can motivate altering the
agent along the way. And, the agent will not resist this alteration! Indeed,
resisting the alteration is ruled out unless the AI can conclude with high
confidence (and not just in expectation) that such resistance is harmless. Since
we assumed the information is reliable, and the alteration is beneficial, the AI
cannot reach such a conclusion.
For example, consider an HDTL agent getting upgraded to "Hippocratic CIRL"
(assuming some sophisticated model of relationship between human behavior and
human preferences). In order to resist the modification, the agent would need a
resistance strategy that (i) doesn't deviate too much from the human baseline
and (ii) ends with the user submitting a favorable report. Such a strategy is
quite unlikely to exist.

1Charlie Steiner1moI think the people most interested in corrigibility are imagining a situation
where we know what we're doing with corrigibility (e.g. we have some grab-bag of
simple properties we want satisfied), but don't even know what we want from
alignment, and then they imagine building an unaligned slightly-sub-human AGI
and poking at it while we "figure out alignment."
Maybe this is a strawman, because the thing I'm describing doesn't make
strategic sense, but I think it does have some model of why we might end up with
something unaligned but corrigible (for at least a short period).

3Vanessa Kosoy1moThe concept of corrigibility was introduced by MIRI, and I don't think that's
their motivation? On my model of MIRI's model, we won't have time to poke at a
slightly subhuman AI, we need to have at least a fairly good notion of what to
do with a superhuman AI upfront. Maybe what you meant is "we won't know how to
construct perfect-utopia-AI, so we will just construct a
prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI
in our leisure". Which, sure, but I don't see what it has to do with
corrigibility.
Corrigibility is neither necessary nor sufficient for safety. It's not strictly
necessary because in theory an AI can resist modifications in some scenarios
while always doing the right thing (although in practice resisting modifications
is an enormous red flag), and it's not sufficient since an AI can be
"corrigible" but cause catastrophic harm before someone notices and fixes it.
What we're supposed to gain from corrigibility is having some margin of error
around alignment, in which case we can decompose alignment as corrigibility +
approximate alignment. But it is underspecified if we don't say along which
dimensions or how big the margin is. If it's infinite margin along all
dimensions then corrigibility and alignment are just isomorphic and there's no
reason to talk about the former.

1Charlie Steiner4moVery interesting - I'm sad I saw this 6 months late.
After thinking a bit, I'm still not sure if I want this desideratum. It seems to
require a sort of monotonicity, where we can get superhuman performance just by
going through states that humans recognize as good, and not by going through
states that humans would think are weird or scary or unevaluable.
One case where this might come up is in competitive games. Chess AI beats humans
in part because it makes moves that many humans evaluate as bad, but are
actually good. But maybe this example actually supports your proposal - it seems
entirely plausible to make a chess engine that only makes moves that some given
population of humans recognize as good, but is better than any human from that
population.
On the other hand, the humans might be wrong about the reason the move is good,
so that the game is made of a bunch of moves that seem good to humans, but where
the humans are actually wrong about why they're good (from the human
perspective, this looks like regularly having "happy surprises"). We might hope
that such human misevaluations are rare enough that quantilization would lead to
moves on average being well-evaluated by humans, but for chess I think that
might be false! Computers are so much better than humans at chess that a very
large chunk of the best moves according to both humans and the computer will be
ones that humans misevaluate.
Maybe that's more a criticism of quantilizers, not a criticism of this
desideratum. So maybe the chess example supports this being a good thing to
want? But let me keep critiquing quantilizers then :P
If what a powerful AI thinks is best (by an exponential amount) is to turn off
the stars [https://en.wikipedia.org/wiki/Star_lifting]until the universe is
colder, but humans think it's scary and ban the AI from doing scary things, the
AI will still try to turn off the stars in one of the edge-case ways that humans
wouldn't find scary. And if we think being manipulated like

1Vanessa Kosoy4moWhen I'm deciding whether to run an AI, I should be maximizing the expectation
of my utility function w.r.t. my belief state. This is just what it means to act
rationally. You can then ask, how is this compatible with trusting another agent
smarter than myself?
One potentially useful model is: I'm good at evaluating and bad at searching
(after all, P≠NP). I can therefore delegate searching to another agent. But, as
you point out, this doesn't account for situations in which I seem to be bad at
evaluating. Moreover, if the AI prior takes an intentional stance towards the
user (in order to help learning their preferences), then the user must be
regarded as good at searching.
A better model is: I'm good at both evaluating and searching, but the AI can
access actions and observations that I cannot. For example, having additional
information can allow it to evaluate better. An important special case is: the
AI is connected to an external computer (Turing RL
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-su?ciently-advanced-agents-use-logic#fEKc88NbDWZavkW9o]
) which we can think of as an "oracle". This allows the AI to have additional
information which is purely "logical". We need infra-Bayesianism to formalize
this: the user has Knightian uncertainty over the oracle's outputs entangled
with other beliefs about the universe.
For instance, in the chess example, if I know that a move was produced by
exhaustive game-tree search then I know it's a good move, even without having
the skill to understand why the move is good in any more detail.
Now let's examine short-term quantilization for chess. On each cycle, the AI
finds a short-term strategy leading to a position that the user evaluates as
good, but that the user would require luck to manage on their own. This is
repeated again and again throughout the game, leading to overall play
substantially superior to the user's. On the other hand, this play is not as
good as the AI would achieve if it just optimized

1Charlie Steiner4moAgree with the first section, though I would like to register my sentiment that
although "good at selecting but missing logical facts" is a better model, it's
still not one I'd want an AI to use when inferring my values.
I think my point is if "turn off the stars" is not a primitive action, but is a
set of states of the world that the AI would overwhelming like to go to, then
the actual primitive actions will get evaluated based on how well they end up
going to that goal state. And since the AI is better at evaluating than us,
we're probably going there.
Another way of looking at this claim is that I'm telling a story about why the
safety bound on quantilizers gets worse when quantilization is iterated.
Iterated quantilization has much worse bounds than quantilizing over the
iterated game, which makes sense if we think of games where the AI evaluates
many actions better than the human.

1Vanessa Kosoy4moI think you misunderstood how the iterated quantilization works. It does not
work by the AI setting a long-term goal and then charting a path towards that
goal s.t. it doesn't deviate too much from the baseline over every short
interval. Instead, every short-term quantilization is optimizing for the user's
evaluation in the end of this short-term interval.

1Charlie Steiner4moAh. I indeed misunderstood, thanks :) I'd read "short-term quantilization" as
quantilizing over short-term policies evaluated according to their expected
utility. My story doesn't make sense if the AI is only trying to push up the
reported value estimates (though that puts a lot of weight on these estimates).

1Adam Shimi10moI don't understand what you mean here by quantilizing. The meaning I know is to
take a random action over the top \alpha actions, on a given base distribution.
But I don't see a distribution here, or even a clear ordering over actions
(given that we don't have access to the utility function).
I'm probably missing something obvious, but more details would really help.

2Vanessa Kosoy10moThe distribution is the user's policy, and the utility function for this purpose
is the eventual success probability estimated by the user (as part of the
timeline report), in the end of the "maneuver". More precisely, the original
quantilization formalism was for the one-shot setting, but you can easily
generalize it, for example I did it
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670375556/quantilal-control-for-finite-mdps]
for MDPs.

1Adam Shimi10moOh, right, that makes a lot of sense.
So is the general idea that we quantilize such that we're choosing in
expectation an action that doesn't have corrupted utility (by intuitively having
something like more than twice as many actions in the quantilization than we
expect to be corrupted), so that we guarantee the probability of following the
manipulation of the learned user report is small?
I also wonder if using the user policy to sample actions isn't limiting, because
then we can only take actions that the user would take. Or do you assume by
default that the support of the user policy is the full action space, so every
action is possible for the AI?

1Vanessa Kosoy10moYes, although you probably want much more than twice. Basically, if the
probability of corruption following the user policy is ϵ and your quantilization
fraction is ϕ then the AI's probability of corruption is bounded by ϵϕ.
Obviously it is limiting, but this is the price of safety. Notice, however, that
the quantilization strategy is only an existence proof. In principle, there
might be better strategies, depending on the prior (for example, the AI might be
able to exploit an assumption that the user is quasi-rational). I didn't specify
the AI by quantilization, I specified it by maximizing EU subject to the
Hippocratic constraint. Also, the support is not really the important part: even
if the support is the full action space, some sequences of actions are possible
but so unlikely that the quantilization will never follow them.

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is, imitation learning algorithms^{[1]} might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.

This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert... (read more)

1Vladimir Nesov1yThis seems similar to gaining uploads prior to AGI, and opens up all those
superorg upload-city amplification/distillation constructions which should get
past human level shortly after. In other words, the limitations of the dataset
can be solved by amplification as soon as the AIs are good enough to be used as
building blocks for meaningful amplification, and something human-level-ish
seems good enough for that. Maybe even GPT-n is good enough for that.

1Vanessa Kosoy1yThat is similar to gaining uploads (borrowing terminology from Egan, we can call
them "sideloads"), but it's not obvious amplification/distillation will work. In
the model based on realizability, the distillation step can fail because the
system you're distilling is too computationally complex (hence, too
unrealizable). You can deal with it by upscaling the compute of the learning
algorithm, but that's not better than plain speedup.

1Vladimir Nesov1yTo me this seems to be essentially another limitation of the human Internet
archive dataset: reasoning is presented in an opaque way (most slow/deliberative
thoughts are not in the dataset), so it's necessary to do a lot of guesswork to
figure out how it works. A better dataset both explains and summarizes the
reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3
can do that to an extent by roleplaying Feynman).
Any algorithm can be represented by a habit of thought (Turing machine style if
you must), and if those are in the dataset, they can be learned. The habits of
thought that are simple enough to summarize get summarized and end up requiring
fewer steps. My guess is that the human faculties needed for AGI can be both
represented by sequences of thoughts (probably just text, stream of
consciousness style) and easily learned with current ML. So right now the main
obstruction is that it's not feasible to build a dataset with those faculties
represented explicitly that's good enough and large enough for current
sample-inefficient ML to grok. More compute in the learning algorithm is only
relevant for this to the extent that we get a better dataset generator that can
work on the tasks before it more reliably.

1Vanessa Kosoy1yI don't see any strong argument why this path will produce superintelligence.
You can have a stream of thought that cannot be accelerated without investing a
proportional amount of compute, while a completely different algorithm would
produce a far superior "stream of thought". In particular, such an approach
cannot differentiate between features of the stream of thought that are
important (meaning that they advance towards the goal) and features of the
stream of though that are unimportant (e.g. different ways to phrase the same
idea). This forces you to solve a task that is potentially much more difficult
than just achieving the goal.

1Vladimir Nesov1yI was arguing that near human level babblers (including the imitation plateau
you were talking about) should quickly lead to human level AGIs by amplification
via stream of consciousness datasets, which doesn't pose new ML difficulties
other than design of the dataset. Superintelligence follows from that by any of
the same arguments as for uploads leading to AGI (much faster technological
progress; if amplification/distillation of uploads is useful straight away, we
get there faster, but it's not necessary). And amplified babblers should be
stronger than vanilla uploads (at least implausibly well-educated,
well-coordinated, high IQ humans).
For your scenario to be stable, it needs to be impossible (in the near term) to
run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain
less effective than very high IQ humans. Otherwise you get acceleration of
technological progress, including ML. So my point is that feasibility of
imitation plateau depends on absence of compute overhang, not on ML failing to
capture some of the ingredients of human general intelligence.

1Vanessa Kosoy1yThe imitation plateau can definitely be rather short. I also agree that
computational overhang is the major factor here. However, a failure to capture
some of the ingredients can be a cause of low computational overhead, whereas a
success to capture all of the ingredients is a cause of high computational
overhang, because the compute necessary to reach superintelligence might be very
different in those two cases. Using sideloads to accelerate progress might still
require years, whereas an "intrinsic" AGI might lead to the classical "foom"
scenario.
EDIT: Although, since training is typically much more computationally expensive
than deployment, it is likely that the first human-level imitators will already
be significantly sped-up compared to humans, implying that accelerating progress
will be relatively easy. It might still take some time from the first prototype
until such an accelerate-the-progress project, but probably not much longer than
deploying lots of automation.

1Vladimir Nesov1yI agree. But GPT-3 seems to me like a good estimate for how much compute it
takes to run stream of consciousness imitation learning sideloads (assuming that
learning is done in batches on datasets carefully prepared by non-learning
sideloads, so the cost of learning is less important). And with that estimate we
already have enough compute overhang to accelerate technological progress as
soon as the first amplified babbler AGIs are developed, which, as I argued
above, should happen shortly after babblers actually useful for automation of
human jobs are developed (because generation of stream of consciousness datasets
is a special case of such a job).
So the key things to make imitation plateau last for years are either sideloads
requiring more compute than it looks like (to me) they require, or amplification
of competent babblers into similarly competent AGIs being a hard problem that
takes a long time to solve.

2Vanessa Kosoy1yAnother thing that might happen is a data bottleneck.
Maybe there will be a good enough dataset to produce a sideload that simulates
an "average" person, and that will be enough to automate many jobs, but for a
simulation of a competent AI researcher you would need a more specialized
dataset that will take more time to produce (since there are a lot less
competent AI researchers than people in general).
Moreover, it might be that the sample complexity grows with the duration of
coherent thought that you require. That's because, unless you're training
directly on brain inputs/outputs, non-realizable (computationally complex)
environment influences contaminate the data, and in order to converge you need
to have enough data to average them out, which scales with the length of your
"episodes". Indeed, all convergence results for Bayesian algorithms we have in
the non-realizable setting require ergodicity, and therefore the time of
convergence (= sample complexity) scales with mixing time, which in our case is
determined by episode length.
In such a case, we might discover that many tasks can be automated by sideloads
with short coherence time, but AI research might require substantially longer
coherence times. And, simulating progress requires by design going
off-distribution along certain dimensions which might make things worse.

I haverepeatedlyargued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.

A possible counterargument is, we don't need to depart far from Bayesianis

This idea was inspired by a correspondence with Adam Shimi.

It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript... (read more)

3Vanessa Kosoy2yActually, as opposed to what I claimed before, we don't need computational
complexity bounds for this definition to make sense. This is because the
Solomonoff prior is made of computable hypotheses but is uncomputable itself.
Given g>0, we define that "π has (unbounded) goal-directed intelligence (at
least) g" when there is a prior ζ and utility function U s.t. for any policy π′,
if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior
and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match
the expected utility of π; in particular, this implies π is optimal since any
policy can be approximated by a computable policy), we say that π is "perfectly
(unbounded) goal-directed".
Compare this notion to the Legg-Hutter intelligence measure. The LH measure
depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI
(which is the maximum of the LH measure) becomes computable or even really
stupid. For example, it can always keep taking the same action because of the
fear that taking any other action leads to an inescapable "hell" state. On the
other hand, goal-directed intelligence differs only by O(1) between UTMs, just
like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be
uncomputable, and the notion of which policies are such doesn't depend on the
UTM at all.
I think that it's also possible to prove that intelligence is rare, in the sense
that, for any computable stochastic policy, if we regard it as a probability
measure over deterministic policies, then for any ϵ>0 there is g s.t. the
probability to get intelligence at least g is smaller than ϵ.
Also interesting is that, for bounded goal-directed intelligence, increasing the
prices can only decrease intelligence by O(1), and a policy that is perfectly
goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think).
In particular, a perfectly unbounded goal-directed policy is perfectly
goal-directed for any price vec

1Vanessa Kosoy1ySome problems to work on regarding goal-directed intelligence. Conjecture 5 is
especially important for deconfusing basic questions in alignment, as it stands
in opposition to Stuart Armstrong's thesis about the impossibility to deduce
preferences from behavior alone.
1. Conjecture. Informally: It is unlikely to produce intelligence by chance.
Formally: Denote Π the space of deterministic policies, and consider some μ∈
ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(
π∗)).
2. Find an "intelligence hierarchy theorem". That is, find an increasing
sequence {gn} s.t. for every n, there is a policy with goal-directed
intelligence in (gn,gn+1) (no more and no less).
3. What is the computational complexity of evaluating g given (i) oracle access
to the policy or (ii) description of the policy as a program or automaton?
4. What is the computational complexity of producing a policy with given g?
5. Conjecture. Informally: Intelligent agents have well defined priors and
utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞,
and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with
intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t. π has
intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ)
and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.

1David Manheim1yre: #5, that doesn't seem to claim that we can infer U given their actions,
which is what the impossibility of deducing preferences is actually claiming.
That is, assuming 5, we still cannot show that there isn't someU1≠U2such thatπ∗(
U1,ζ)=π∗(U2,ζ).
(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and
basic result in the decision theory / economics / philosophy literature.)

1Vanessa Kosoy1yYou misunderstand the intent. We're talking about inverse reinforcement
learning. The goal is not necessarily inferring the unknown U, but producing
some behavior that optimizes the unknown U. Ofc if the policy you're observing
is optimal then it's trivial to do so by following the same policy. But, using
my approach we might be able to extend it into results like "the policy you're
observing is optimal w.r.t. certain computational complexity, and your goal is
to produce an optimal policy w.r.t. higher computational complexity."
(Btw I think the formal statement I gave for 5 is false, but there might be an
alternative version that works.)
I am referring to this
[http://papers.neurips.cc/paper/7803-occams-razor-is-insufficient-to-infer-the-preferences-of-irrational-agents]
and related work by Armstrong.

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).

The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requ

1Vanessa Kosoy2yWe can modify the population game setting to study superrationality. In order to
do this, we can allow the agents to see a fixed size finite portion of the their
opponents' histories. This should lead to superrationality for the same reasons
I discussed
[https://www.alignmentforum.org/posts/S3W4Xrmp6AL7nxRHd/formalising-decision-theory-is-hard#3yw2udyFfvnRC8Btr]
before [https://agentfoundations.org/item?id=507]. More generally, we can
probably allow each agent to submit a finite state automaton of limited size,
s.t. the opponent history is processed by the automaton and the result becomes
known to the agent.
What is unclear about this is how to define an analogous setting based on source
code introspection. While arguably seeing the entire history is equivalent to
seeing the entire source code, seeing part of the history, or processing the
history through a finite state automaton, might be equivalent to some limited
access to source code, but I don't know to define this limitation.
EDIT: Actually, the obvious analogue is processing the source code through a
finite state automaton.

1Vanessa Kosoy2yInstead of postulating access to a portion of the history or some kind of
limited access to the opponent's source code, we can consider agents with full
access to history / source code but finite memory. The problem is, an agent with
fixed memory size usually cannot have regret going to zero, since it cannot
store probabilities with arbitrary precision. However, it seems plausible that
we can usually get learning with memory of size O(log11−γ). This is because
something like "counting pieces of evidence" should be sufficient. For example,
if consider finite MDPs, then it is enough to remember how many transitions of
each type occurred to encode the belief state. There question is, does assuming
O(log11−γ) memory (or whatever is needed for learning) is enough to reach
superrationality.

1Gurkenglas2yWhat do you mean by equivalent? The entire history doesn't say what the opponent
will do later or would do against other agents, and the source code may not
allow you to prove what the agent does if it involves statements that are true
but not provable.

1Vanessa Kosoy2yFor a fixed policy, the history is the only thing you need to know in order to
simulate the agent on a given round. In this sense, seeing the history is
equivalent to seeing the source code.
The claim is: In settings where the agent has unlimited memory and sees the
entire history or source code, you can't get good guarantees (as in the folk
theorem for repeated games). On the other hand, in settings where the agent sees
part of the history, or is constrained to have finite memory (possibly of size O
(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or
some other strong desideratum that deserves to be called "superrationality".

1Vanessa Kosoy2yIn the previous "population game" setting, we assumed all players are "born" at
the same time and learn synchronously, so that they always play against players
of the same "age" (history length). Instead, we can consider a "mortal
population game" setting where each player has a probability 1−γ to die on every
round, and new players are born to replenish the dead. So, if the size of the
population is N (we always consider the "thermodynamic" N→∞ limit), N(1−γ)
players die and the same number of players are born on every round. Each
player's utility function is a simple sum of rewards over time, so, taking
mortality into account, effectively ey have geometric time discount. (We could
use age-dependent mortality rates to get different discount shapes, or allow
each type of player to have different mortality=discount rate.) Crucially, we
group the players into games randomly, independent of age.
As before, each player type i chooses a policy . (We can also consider the case
where players of the same type may have different policies, but let's keep it
simple for now.) In the thermodynamic limit, the population is described as a
distribution over histories, which now are allowed to be of variable length: μn∈
ΔO∗. For each assignment of policies to player types, we get dynamics μn+1=Tπ(μn
) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games, mortal
population games naturally give rise to dynamical systems.
If we consider only the age distribution, then its evolution doesn't depend on π
and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk.
Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that
corresponds to the age distribution ζ. We denote it P.
Does the dynamics have fixed points? O∗ can be regarded as a subspace of (O⊔{⊥})
ω. The later is compact (in the product topology) by Tychonoff's theorem and
Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability
measure spaces, Δ(O⊔{⊥})ω is also comp

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.

First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology

This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communicati

1Vanessa Kosoy1yI gave a talk on Dialogic Reinforcement Learning in the AI Safety Discussion
Day, and there is a recording
[https://drive.google.com/file/d/1zKs3uOcR32nTMJ5YNOMZkcL7R_mzi2t6/view?usp=sharing]
.

1Vanessa Kosoy1yA variant of Dialogic RL with improved corrigibility. Suppose that the AI's
prior allows a small probability for "universe W" whose semantics are, roughly
speaking, "all my assumptions are wrong, need to shut down immediately". In
other words, this is a universe where all our prior shaping is replaced by the
single axiom that shutting down is much higher utility than anything else.
Moreover, we add into the prior that assumption that the formal question "W?" is
understood perfectly by the user even without any annotation. This means that,
whenever the AI assigns a higher-than-threshold probability to the user
answering "yes" if asked "W?" at any uncorrupt point in the future, the AI will
shutdown immediately. We should also shape the prior s.t. corrupt futures also
favor shutdown: this is reasonable in itself, but will also ensure that the AI
won't arrive at believing too many futures to be corrupt and thereby avoid the
imperative to shutdown as response to a confirmation of W.
Now, this won't help if the user only resolves to confirm W after something
catastrophic already occurred, such as the AI releasing malign subagents into
the wild. But, something of the sort is true for any corrigibility scheme:
corrigibility is about allowing the user to make changes in the AI on eir own
initiative, which can always be too late. This method doesn't ensure safety in
itself, just hardens a system that is supposed to be already close to safe.
It would be nice if we could replace "shutdown" by "undo everything you did and
then shutdown" but that gets us into thorny specifications issues. Perhaps it's
possible to tackle those issues by one of the approaches to "low impact".

1Vanessa Kosoy1yUniverse W should still be governed by a simplicity prior. This means that
whenever the agent detects a salient pattern that contradicts the assumptions of
its prior shaping, the probability of W increases leading to shutdown. This
serves as an additional "sanity test" precaution.

Probably not too original but I haven't seen it clearly written anywhere.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to in... (read more)

2Alex Turner10moI think this would make a good top-level post. I have the feeling I’ll want to
link to it later.

1Vanessa Kosoy10moI retracted part of that, see the edit.

1Vanessa Kosoy10moMore observations about this attack vector ("attack from counterfactuals"). I
focus on "amplifying by subjective time".
* The harder the takeoff the more dangerous this attack vector: During every
simulation cycle, ability to defend against simulated malign AI depends on
the power of the defense system in the beginning of the cycle[1]
[#fn-tXbSLC7xHHD63K7Kq-1]. On the other hand, the capability of the attacker
depends on its power in the end of the cycle. Therefore, if power grows very
fast this is bad news for the defender. On the other hand, if power grows
very slowly, the defender wins because defense is probably easier.
* Inner control of anchor makes system safer: Given a fixed objective time
anchor, the more times the simulation goes through a cycle, the higher its
exposure to the threat, since improving the defense system takes time even if
you have all the knowledge. Therefore, we want to give the simulated user the
ability to move the objective time anchor forward in time, in order to
benefit from improvements in the defense system.
* Additional information about the external world makes system safer: Suppose
our subjective probability of a malign AI successfully attacking during a
single simulation cycle is 0.1%. Then performing 1000 cycles superficially
yields a failure probability of about 37%. However, this is only the case if
attacks on different cycles are independent random variables. If the AI knows
as much about the possibility of such an attack as we do, this is a justified
assumption. However, if the AI knows more, then, from our subjective
perspective, these variables are positively correlated. In particular, if the
AI knows with certainty whether a successful attack would occur or not the
failure probability over 1000 cycles goes back to 0.1%. Therefore, it is
beneficial to give the AI extra information about the world, just so it can
estimate the probability of attack with m

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)

In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. B

Learning theory starts from formulating natural desiderata for agents, whereas "logic-AI" usually starts from postulating a logic-based model of the agent ad hoc.

Learning theory naturally allows analyzing computational complexity whereas logic-AI often uses models that are either clearly intractable or even clearly incomputable from the onset.

Learning theory focuses on objects that are observable o

I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding max

3Nisan2yMy takeaway from this is that if we're doing policy selection in an environment
that contains predictors, instead of applying the counterfactual belief that the
predictor is always right, we can assume that we get rewarded if the predictor
is wrong, and then take maximin.
How would you handle Agent Simulates Predictor? Is that what TRL is for?

2Vanessa Kosoy2yThat's about right. The key point is, "applying the counterfactual belief that
the predictor is always right" is not really well-defined (that's why people
have been struggling with TDT/UDT/FDT for so long) while the thing I'm doing is
perfectly well-defined. I describe agents that are able to learn which
predictors exist in their environment and respond rationally ("rationally"
according to the FDT philosophy).
TRL is for many things to do with rational use of computational resources, such
as (i) doing multi-level modelling
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-sufficiently-advanced-agents-use-logic#vAtz6tfscsALGPr32]
in order to make optimal use of "thinking time" and "interacting with
environment time" (i.e. simultaneously optimize sample and computational
complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian
daemons (iv) preventing thought crimes. But, yes, it also provides a solution to
ASP
[https://www.alignmentforum.org/posts/S3W4Xrmp6AL7nxRHd/formalising-decision-theory-is-hard#FXt6z9ycAio9jFAtW]
. TRL agents can learn whether it's better to be predictable or predicting.

1Chris_Leong2y"The key point is, "applying the counterfactual belief that the predictor is
always right" is not really well-defined" - What do you mean here?
I'm curious whether you're referring to the same as or similar to the issue I
was referencing in Counterfactuals for Perfect Predictors
[https://www.lesswrong.com/posts/AKkFh3zKGzcYBiPo7/counterfactuals-for-perfect-predictors]
. The TLDR is that I was worried that it would be inconsistent for an agent that
never pays in Parfait's Hitchhiker to end up in town if the predictor is
perfect, so that it wouldn't actually be well-defined what the predictor was
predicting. And the way I ended up resolving this was by imagining it as an
agent that takes input and asking what it would output if given that
inconsistent input. But not sure if you were referencing this kind of concern or
something else.

2Vanessa Kosoy2yIt is not a mere "concern", it's the crux of problem really. What people in the
AI alignment community have been trying to do is, starting with some factual and
"objective" description of the universe (such a program or a mathematical
formula) and deriving counterfactuals. The way it's supposed to work is, the
agent needs to locate all copies of itself or things "logically correlated" with
itself (whatever that means) in the program, and imagine it is controlling this
part. But a rigorous definition of this that solves all standard decision
theoretic scenarios was never found.
Instead of doing that, I suggest a solution of different nature. In
quasi-Bayesian RL, the agent never arrives at a factual and objective
description of the universe. Instead, it arrives at a subjective description
which already includes counterfactuals. I then proceed to show that, in
Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the
same expected utility promised by UDT).

1Chris_Leong2yYeah, I agree that the objective descriptions can leave out vital information,
such as how the information you know was acquired, which seems important for
determining the counterfactuals.

1Vladimir Slepnev2yBut in Newcomb's problem, the agent's reward in case of wrong prediction is
already defined. For example, if the agent one-boxes but the predictor predicted
two-boxing, the reward should be zero. If you change that to +infinity, aren't
you open to the charge of formalizing the wrong problem?

1Vanessa Kosoy2yThe point is, if you put this "quasi-Bayesian" agent into an iterated
Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward
associated with FDT). So, if you're judging it from the side, you will have to
concede it behaves rationally, regardless of its internal representation of
reality.
Philosophically, my point of view is, it is an error to think that
counterfactuals have objective, observer-independent, meaning. Instead, we can
talk about some sort of consistency conditions between the different points of
view. From the agent's point of view, it would reach Nirvana if it dodged the
predictor. From Omega's point of view, if Omega two-boxed and the agent
one-boxed, the agent's reward would be zero (and the agent would learn its
beliefs were wrong). From a third-person point of view, the counterfactual
"Omega makes an error of prediction" is ill-defined, it's conditioning on an
event of probability 0.

1Vladimir Slepnev2yYeah, I think I can make peace with that. Another way to think of it is that we
can keep the reward structure of the original Newcomb's problem, but instead of
saying "Omega is almost always right" we add another person Bob (maybe the mad
scientist who built Omega) who's willing to pay you a billion dollars if you
prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess
the remaining question is why minimaxing is the right thing to do. And if
randomizing is allowed, the idea of Omega predicting how you'll randomize seems
a bit dodgy as well.

3Vanessa Kosoy2yAnother explanation why maximin is a natural decision rule: when we apply
maximin to fuzzy beliefs
[https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v]
, the requirement to learn a particular class of fuzzy hypotheses is a very
general way to formulate asymptotic performance desiderata for RL agents. So
general that it seems to cover more or less anything you might want. Indeed, the
definition directly leads to capturing any desideratum of the form
limγ→1Eμπγ[U(γ)]≥f(μ)
Here, f doesn't have to be concave: the concavity condition in the definition of
fuzzy beliefs is there because we can always assume it without loss of
generality. This is because the left hand side in linear in μ so any π that
satisfies this will also satisfy it for the concave hull of f.
What if instead of maximin we want to apply the minimax-regret decision rule?
Then the desideratum is
limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)
But, it has the same form! Therefore we can consider it as a special case of the
applying maximin (more precisely, it requires allowing the fuzzy belief to
depend on γ, but this is not a problem for the basics of the formalism).
What if we want our policy to be at least as good as some fixed policy π′0? Then
the desideratum is
limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]
It still has the same form!
Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata
of the form:
limγ→1Eμπγ[U(γ)]≥f(π,μ)
To achieve this, we postulate a predictor that guesses the policy, producing the
guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we
assume the guess is not influenced by the agent's actions so we don't need π in
the expected value). Using Nirvana trick, we effectively force the guess to be
accurate.
In particular, this captures self-referential desiderata of the type "the policy
cannot be improved by changing it in this particular way". These are of the
form:
limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]
It also allo

1Vanessa Kosoy2yWell, I think that maximin is the right thing to do because it leads to
reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think
of incomplete models as properties that the environment might satisfy. It is
necessary to speak of properties instead of complete models since the
environment might be too complex to understand in full (for example because it
contains Omega, but also for more prosaic reasons), but we can hope it at least
has properties/patterns the agent can understand. A quasi-Bayesian agent has the
guarantee that, whenever the environment satisfies one of the properties in its
prior, the expected utility will converge at least to the maximin for this
property. In other words, such an agent is able to exploit any true property of
the environment it can understand. Maybe a more "philosophical" defense of
maximin is possible, analogous to VNM / complete class theorems, but I don't
know (I actually saw some papers in that vein but haven't read them in detail.)
If the agent has random bits that Omega doesn't see, and Omega is predicting the
probabilities of the agent's actions, then I think we can still solve it with
quasi-Bayesian agents but it requires considering more complicated models and I
haven't worked out the details. Specifically, I think that we can define some
function X that depends on the agent's actions and Omega's predictions so far (a
measure of Omega's apparent inaccuracy), s.t. if Omega is an accurate predictor,
then, the supremum of X over time is finite with probability 1. Then, we
consider consider a family of models, where model number n says that X<n for all
times. Since at least one of these models is true, the agent will learn it, and
will converge to behaving appropriately.
EDIT 1: I think X should be something like, how much money would a gambler
following a particular strategy win, betting against Omega.
EDIT 2: Here is the solution. In the case of original Newcomb, consider a
gambler that bets against Om

1Linda Linsefors2yI agree that you can assign what ever belief you want (e.g. what ever is useful
for the agents decision making proses) for for what happens in the
counterfactual when omega is wrong, in decision problems where Omega is assumed
to be a perfect predictor. However if you want to generalise to cases where
Omega is an imperfect predictor (as you do mention), then I think you will (in
general) have to put in the correct reward for Omega being wrong, becasue this
is something that might actually be observed.

1Vanessa Kosoy2yThe method should work for imperfect predictors as well. In the simplest case,
the agent can model the imperfect predictor as perfect predictor + random noise.
So, it definitely knows the correct reward for Omega being wrong. It still
believes in Nirvana if "idealized Omega" is wrong.

Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re... (read more)

1Ofer Givoli1yI think that in embedded settings (with a bounded version of Solomonoff
induction) convergence may never occur, even in the limit as the amount of
compute that is used for executing the agent goes to infinity. Suppose the
observation history contains sensory data that reveals the probability
distribution that the agent had, in the last time step, for the next number it's
going to see in the target sequence. Now consider the program that says: "if the
last number was predicted by the agent to be 0 with probability larger than 1−2−
1010 then the next number is 1; otherwise it is 0." Since it takes much less
than 1010 bits to write that program, the agent will never predict two times in
a row that the next number is 0 with probability larger than 1−2−1010 (after
observing only 0s so far).

One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be unlearnable, mea

3Vanessa Kosoy2moThere's a class of AI risk mitigation strategies which relies on the users to
perform the pivotal act using tools created by AI (e.g. nanosystems). These
strategies are especially appealing if we want to avoid human models. Here is a
concrete alignment protocol for these strategies, closely related to AQD
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=h3Ww6nyt9fpj7BLyo]
, which we call autocalibrating quantilized RL (AQRL).
First, suppose that we are able formulate the task as episodic RL with a
formally specified reward function. The reward function is necessarily only a
proxy for our true goal, since it doesn't contain terms such as "oh btw don't
kill people while you're building the nanosystem". However, suppose the task is
s.t. accomplishing it in the intended way (without Goodharting or causing
catastrophic side effects) is easier than performing any attack. We will call
this the "relative difficulty assumption" (RDA). Then, there exists a value for
the quantilization parameter s.t. quantilized RL performs the task in the
intended way.
We might not know how to set the quantilization parameter on our own, but we can
define a performance goal for the task (in terms of expected total reward) s.t.
the RDA holds. This leads to algorithms which gradually tune the quantilization
parameter until the performance goal is met, while maintaining a proper balance
between safety and sample complexity. Here it is important to keep track of
epistemic vs. aleatoric uncertainty: the performance goal is the expectation of
total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a
given hypothesis), whereas the safety goal is a bound on the expected cost of
overshooting the optimal quantilization parameter relatively to both aleatoric
and epistemic uncertainty (i.e. uncertainty between different hypotheses). This
secures the system against malign hypotheses that are trying to cause an
overshoot.
Notice the hardenin

There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.

We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let's assume the offline learning setting. Domain D is represented by instance space X, label space Y, distribution μ∈Δ(X×Y) and loss function L:Y×Y→R. Similarly, domain E is represented by inst... (read more)

3Vanessa Kosoy1moThe above threat model seems too paranoid: it is defending against an adversary
that sees the trained model and knows the training algorithm. In our
application, the model itself is either dangerous or not independent of the
training algorithm that produced it.
Let ϵ>0 be our accuracy requirement for the target domain. That is, we want f:X→
Y s.t.
Exy∼μ[L(y,f(x))]≤minf′:X→YExy∼μ[L(y,f(x))]+ϵ
Given any f:X→Y, denote ζf,ϵ to be ζ conditioned on the inequality above, where
μ is regarded as a random variable. Define Bf,ϵ:(Z×W)∗×Z→W by
Bf,ϵ(T,z):=argminw∈WEν∼ζf,ϵ,T′z′w′∼ν|T|+1[M(w′,w)∣T′=T,z′=z]
That is, Bf,ϵ is the Bayes-optimal learning algorithm for domain E w.r.t. prior
ζf,ϵ.
Now, consider some A:(X×Y)∗×(Z×W)∗×X→Y. We regard A as a learning algorithm for
domain D which undergoes "antitraining" for domain E: we provide it with a
dataset for domain E that tells it what not to learn. We require that A achieves
asymptotic accuracy ϵ[1] [#fn-hkq9wzK7nR89sgdSj-1], i.e. that if μ is sampled
from ζ then with probability 1
limn→∞supT∈(Z×W)∗ESxy∼μn+1[L(y,A(S,T,x))]≤minf:X→YExy∼μ[L(y,f(x))]+ϵ
Under this constraint, we want A to be as ignorant as possible about domain E,
which we formalize as maximizing IGA defined by
IGAnm:=Eμν∼ζ,S∼μn,Tzw∼νm+1[M(w,BA(S,T),ϵ(T,z))]
It is actually important to consider m>0 because in order to exploit the
knowledge of the model about domain E, an adversary needs to find the right
embedding of this domain into the model's "internal language". For m=0 we can
get high IG despite the model actually knowing domain E because the adversary B
doesn't know the embedding, but for m>0 it should be able to learn the embedding
much faster than learning domain E from scratch.
We can imagine a toy example where X=Z=Rd, the projections of μ and ν to X and Z
respectively are distributions concentrated around two affine subspaces, Y=W={−1
,+1} and the labels are determined by the sign of a polynomial which is the same
for μ and ν up to a linear trans

Epistemic status: most elements are not new, but the synthesis seems useful.

Here is an alignment protocol that I call "autocalibrated quantilzed debate" (AQD).

Arguably the biggest concern with naive debate^{[1]} is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of "arguments and counterargument" doesn't make sense anymore. Let's call utterances that have this property "Lovecraftian". To counter this, I suggest using quantilization. Quanti... (read more)

1Vladimir Nesov2moI'm not sure this attacks goodharting directly enough. Optimizing a system for
proxy utility moves its state out-of-distribution where proxy utility
generalizes training utility incorrectly. This probably holds for debate
optimized towards intended objectives as much as for more concrete framings with
state and utility.
Dithering across the border of goodharting (of scope of a proxy utility) with
quantilization is actionable, but isn't about defining the border or formulating
legible strategies for what to do about optimization when approaching the
border. For example, one might try for shutdown, interrupt-for-oversight, or
getting-back-inside-the-borders when optimization pushes the system outside,
which is not quantilization. (Getting-back-inside-the-borders might even have
weird-x-risk prevention as a convergent drive, but will oppose corrigibility.
Some version of oversight/amplification might facilitate corrigibility.)
Debate seems more useful for amplification, extrapolating concepts in a way
humans would, in order to become acceptable proxies in wider scopes, so that
more and more debates become non-lovecraftian. This is a different concern from
setting up optimization that works with some fixed proxy concepts as given.

2Vanessa Kosoy2moI don't understand what you're saying here.
For debate, goodharting means producing an answer which can be defended
successfully in front of the judge, even in the face of an opponent pointing out
all the flaws, but which is nevertheless bad. My assumption here is: it's harder
to produce such an answer than producing a genuinely good (and defensible)
answer. If this assumption holds, then there is a range of quantilization
parameters which yields good answers.
For the question of "what is a good plan to solve AI risk", the assumption seems
solid enough since we're not worried about coming across such deceptive plans on
our own, and it's hard to imagine humans producing one even on purpose. To the
extent our search for plans relies mostly on our ability to evaluate arguments
and find counterarguments, it seems like the difference between the former and
the latter is not great anyway. This argument is especially strong if we use
human debaters as baseline distribution, although in this case we are vulnerable
to same competitiveness problem as amplified-imitation, namely that reliably
predicting rich outputs might be infeasible.
For the question of "should we continue changing the quantilization parameter",
the assumption still holds because the debater arguing to stop at the given
point can win by presenting a plan to solve AI risk which is superior to
continuing to change the parameter.

1Vladimir Nesov2moGoodharting is about what happens in situations where "good" is undefined or
uncertain or contentious, but still gets used for optimization. There are
situations where it's better-defined, and situations where it's ill-defined, and
an anti-goodharting agent strives to optimize only within scope of where it's
better-defined. I took "lovecraftian" as a proxy for situations where it's
ill-defined, and base distribution of quantilization that's intended to oppose
goodharting acts as a quantitative description of where it's taken as
better-defined, so for this purpose base distribution captures non-lovecraftian
situations. Of the options you listed for debate, the distribution from
imitation learning seems OK for this purpose, if amended by some anti-weirdness
filters to exclude debates that can't be reliably judged.
The main issues with anti-goodharting that I see is the difficulty of defining
proxy utility and base distribution, the difficulty of making it corrigible, not
locking-in into fixed proxy utility and base distribution, and the question of
what to do about optimization that points out of scope.
My point is that if anti-goodharting and not development of quantilization is
taken as a goal, then calibration of quantilization is not the kind of thing
that helps, it doesn't address the main issues. Like, even for quantilization,
fiddling with base distribution and proxy utility is a more natural framing
that's strictly more general than fiddling with the quantilization parameter. If
we are to pick a single number to improve, why privilege the quantilization
parameter instead of some other parameter that influences base distribution and
proxy utility?
The use of debates for amplification in this framing is for corrigibility part
of anti-goodharting, a way to redefine utility proxy and expand the base
distribution, learning from how the debates at the boundary of the previous base
distribution go. Quantilization seems like a fine building block for this,
sampling

2Vanessa Kosoy2moThe proxy utility in debate is perfectly well-defined: it is the ruling of the
human judge. For the base distribution I also made some concrete proposals
(which certainly might be improvable but are not obviously bad). As to
corrigibility, I think it's an ill-posed concept
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=5Rxgkzqr8XsBwcEQB#romyHyuhq6nPH5uJb]
. I'm not sure how you imagine corrigibility in this case: AQD is a series of
discrete "transactions" (debates), and nothing prevents you from modifying the
AI between one and another. Even inside a debate, there is no incentive in the
outer loop to resist modifications, whereas daemons would be impeded by
quantilization. The "out of scope" case is also dodged by quantilization, if I
understand what you mean by "out of scope".
Why is it strictly more general? I don't see it. It seems false, since for
extreme value of the quantilization parameter we get optimization which is
deterministic and hence cannot be equivalent to quantilization with different
proxy and distribution.
The reason to pick the quantilization parameter is because it's hard to
determine, as opposed to the proxy and base distribution[1]
[#fn-4ftZGjn8jZiSGQpqd-1] for which there are concrete proposals with
more-or-less clear motivation.
I don't understand which "main issues" you think this doesn't address. Can you
describe a concrete attack vector?
--------------------------------------------------------------------------------
1. If the base distribution is a bounded simplicity prior then it will have
some parameters, and this is truly a weakness of the protocol. Still, I
suspect that safety is less sensitive to these parameters and it is more
tractable to determine them by connecting our ultimate theories of AI with
brain science (i.e. looking for parameters which would mimic the
computational bounds of human cognition). ↩︎ [#fnref-4ftZGjn8jZiSGQpqd-1]

Epistemic status: no claims to novelty, just (possibly) useful terminology.

[EDIT: I increased all the class numbers by 1 in order to admit a new definition of "class I", see child comment.]

I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the "potential" class and the latter the "effective" class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).

Class II: Systems that only ever receive synthetic data that has nothing to do with the real world

Examples:

AI that is trained to learn Go by self-play

AI that is trained to prove random mathematical statements

AI that is trained to make rapid predictions of future cell states in the game of life for random initial conditions

AI that is trained to find regularities in sequences corresponding to random programs on some natural universal Turing machin

1Vanessa Kosoy2moThe idea comes from this
[https://www.lesswrong.com/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty?commentId=sabukDmYbLw2WNeEv]
comment of Eliezer.
Class II or higher systems might admit an attack vector by daemons that infer
the universe from the agent's source code. That is, we can imagine a malign
hypothesis that makes a treacherous turn after observing enough past actions to
infer information about the system's own source code and infer the physical
universe from that. (For example, in a TRL setting it can match the actions to
the output of a particular program for envelope.) Such daemons are not as
powerful as malign simulation hypotheses
[https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/]
, since their prior probability is not especially large (compared to the true
hypothesis), but might still be non-negligible. Moreover, it is not clear
whether the source code can realistically have enough information to enable an
attack, but the opposite is not entirely obvious.
To account for this I propose the designate class I systems which don't admit
this attack vector. For the potential sense, it means that either (i) the
system's design is too simple to enable inferring much about the physical
universe, or (ii) there is no access to past actions (including opponent actions
for self-play) or (iii) the label space is small, which means an attack requires
making many distinct errors, and such errors are penalized quickly. And ofc it
requires no direct access to the source code.
We can maybe imagine an attack vector even for class I systems, if most
metacosmologically
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=N8oamtFAhWKEbyCBq]
plausible universes are sufficiently similar, but this is not very likely.
Nevertheless, we can reserve the label class 0 for systems that explicitly rule
out even such attacks.

4Vanessa Kosoy1yIn the anthropic trilemma
[https://www.lesswrong.com/posts/y7jZ9BLEeuNTzgAE5/the-anthropic-trilemma],
Yudkowsky writes about the thorny problem of understanding subjective
probability in a setting where copying and modifying minds is possible. Here, I
will argue that infra-Bayesianism (IB) leads to the solution.
Consider a population of robots, each of which in a regular RL agent. The
environment produces the observations of the robots, but can also make copies or
delete portions of their memories. If we consider a random robot sampled from
the population, the history they observed will be biased compared to the
"physical" baseline. Indeed, suppose that a particular observation c has the
property that every time a robot makes it, 10 copies of them are created in the
next moment. Then, a random robot will have c much more often in their history
than the physical frequency with which c is encountered, due to the resulting
"selection bias". We call this setting "anthropic RL" (ARL).
The original motivation for IB was non-realizability. But, in ARL, Bayesianism
runs into issues even when the environment is realizable from the "physical"
perspective. For example, we can consider an "anthropic MDP" (AMDP). An AMDP has
finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗).
The output is a string of states instead of a single state, because many copies
of the agent might be instantiated on the next round, each with their own state.
In general, there will be no single Bayesian hypothesis that captures the
distribution over histories that the average robot sees at any given moment of
time (at any given moment of time we sample a robot out of the population and
look at their history). This is because the distributions at different moments
of time are mutually inconsistent.
[EDIT: Actually, given that we don't care about the order of robots, the
signature of the transition kernel should be T:A×S→ΔNS]
The consistency that is violated is exactly the c

1Charlie Steiner1yCould you expand a little on why you say that no Bayesian hypothesis captures
the distribution over robot-histories at different times? It seems like you can
unroll an AMDP into a "memory MDP" that puts memory information of the robot
into the state, thus allowing Bayesian calculation of the distribution over
states in the memory MDP to capture history information in the AMDP.

1Vanessa Kosoy1yI'm not sure what do you mean by that "unrolling". Can you write a mathematical
definition?
Let's consider a simple example. There are two states: s0 and s1. There is just
one action so we can ignore it. s0 is the initial state. An s0 robot transition
into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How
will our population look like?
0th step: all robots remember s0
1st step: all robots remember s0s1
2nd step: 1/2 of robots remember s0s1s0 and 1/2 of robots remember s0s1s1
3rd step: 1/3 of robots remembers s0s1s0s1, 1/3 of robots remember s0s1s1s0 and
1/3 of robots remember s0s1s1s1
There is no Bayesian hypothesis a robot can have that gives correct predictions
both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr
[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr
[s0s1s0]=13, Pr[s0s1s1]=23.
In other words, there is no Bayesian hypothesis s.t. we can guarantee that a
randomly sampled robot on a sufficiently late time step will have learned this
hypothesis with high probability. The apparent transition probabilities keep
shifting s.t. it might always continue to seem that the world is complicated
enough to prevent our robot from having learned it already.
Or, at least it's not obvious there is such a hypothesis. In this example, Pr[s0
s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all
probabilities converge fast enough for learning to happen, in general? I don't
know, maybe for finite state spaces it can work. Would definitely be interesting
to check.
[EDIT: actually, in this example there is such a hypothesis but in general there
isn't, see below
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=E58br2mJWbgzQqZhX]
]

1Charlie Steiner1yGreat example. At least for the purposes of explaining what I mean :) The memory
AMDP would just replace the statess0,s1with the memory states[s0],[s1],[s0,s0],[
s0,s1], etc. The action takes a robot in[s0]to memory state[s0,s1], and a robot
in[s0,s1]to one robot in[s0,s1,s0]and another in[s0,s1,s1].
(Skip this paragraph unless the specifics of what's going on aren't obvious:
given a transition distributionP(s′∗|s,π)(P being the distribution over sets of
states s'* given starting state s and policyπ), we can define the memory
transition distributionP(s′∗m|sm,π)given policyπand starting "memory state"sm∈S∗
(Note that this star actually does mean finite sequences, sorry for notational
ugliness). First we plug the last element ofsminto the transition distribution
as the current state. Then for eachs′∗in the domain, for each element ins′∗we
concatenate that element onto the end ofsmand collect theses′minto a sets′∗m,
which is assigned the same probabilityP(s′∗).)
So now at time t=2, if you sample a robot, the probability that its state begins
with[s0,s1,s1]is 0.5. And at time t=3, if you sample a robot that probability
changes to 0.66. This is the same result as for the regular MDP, it's just that
we've turned a question about the history of agents, which may be ill-defined,
into a question about which states agents are in.
I'm still confused about what you mean by "Bayesian hypothesis" though. Do you
mean a hypothesis that takes the form of a non-anthropic MDP?

1Vanessa Kosoy1yI'm not quite sure what are you trying to say here, probably my explanation of
the framework was lacking. The robots already remember the history, like in
classical RL. The question about the histories is perfectly well-defined. In
other words, we are already implicitly doing what you described. It's like in
classical RL theory, when you're proving a regret bound or whatever, your
probability space consists of histories.
Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then
any environment can be regarded as an MDP (whose states are histories). That is,
I'm talking about hypotheses which conform to the classical "cybernetic agent
model". If you wish, we can call it "Bayesian cybernetic hypothesis".
Also, I want to clarify something I was myself confused about in the previous
comment. For an anthropic Markov chain (when there is only one action) with a
finite number of states, we can give a Bayesian cybernetic description, but for
a general anthropic MDP we cannot even if the number of states is finite.
Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+.
Assuming the chain is communicating, ET is an irreducible non-negative matrix,
so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal
eigenvector η∈RS+. We then get the subjective transition kernel:
ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′
Now, consider the following example of an AMDP. There are three actions A:={a,b,
c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0
robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we
apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an
s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in
one robot whose state is s0 with probability 12 and s1 with probability 12.
Consider the following two policies. πa takes the sequence of actions cacaca…
and πb takes the sequence of actions cbcbcb…. A population that f

1Charlie Steiner1yAh, okay, I see what you mean. Like how preferences are divisible into "selfish"
and "worldly" components, where the selfish component is what's impacted by a
future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to "wordly" and just
sum the reward calculated at individual timesteps, or analogous to "selfish" and
calculated by taking the limit of the subjective distribution over parts of the
history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn't understand what you
were getting at, so I was concerned that something unrealistic was going on. For
example, if you assume that the agent can remember its history, how can you
possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you're talking about
the subjective probability of a state given a policy and a timestep, but if the
agents know their histories there is no actual agent in the information-state
that corresponds to having those probabilities. In an MDP the agents just have
probabilities over transitions - so maybe a clearer example is an agent that
copies itself if it wins the lottery having a larger subjective transition
probability of going from gambling to winning. (i.e. states are losing and
winning, actions are gamble and copy, the policy is to gamble until you win and
then copy).

1Vanessa Kosoy1yAMDP is only a toy model that distills the core difficulty into more or less the
simplest non-trivial framework. The rewards are "selfish": there is a reward
function r:(S×A)∗→R which allows assigning utilities to histories by time
discounted summation, and we consider the expected utility of a random robot
sampled from a late population. And, there is no memory wiping. To describe
memory wiping we indeed need to do the "unrolling" you suggested. (Notice that
from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=SBPzgAZgFFxtL9E64]
, but (i) instead of A×O labels use A×M labels, where M is the set of possible
memory states (a policy is then described by π:M→A), to allow for agents that
don't fully trust their memory (ii) consider another chain with a bigger state
space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible.
Here, the semantics of p(s) is: the multiset of ontological states resulting
from interpreting the physical state s by taking the viewpoints of different
agents s contains.
I didn't understand "no actual agent in the information-state that corresponds
to having those probabilities". What does it mean to have an agent in the
information-state?

1Charlie Steiner1yNevermind, I think I was just looking at it with the wrong class of reward
function in mind.

2Vanessa Kosoy1yThere is a formal analogy between infra-Bayesian decision theory (IBDT) and
modal updateless decision theory
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670374e61/using-modal-fixed-points-to-formalize-logical-causality]
(MUDT).
Consider a one-shot decision theory setting. There is a set of unobservable
states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent
has some belief β∈□S[1] [#fn-mQXwc4sNgtZSzqodo-1], and it chooses the action a∗:
=argmaxa∈AEβ[λs.r(a,s)].
We can construct an equivalent scenario, by augmenting this one with a perfect
predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of
(p,s) is "the unobservable state is s and Omega predicts the agent will take
action p". We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′
by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e
we have utter Knightian uncertainty about Omega). This is essentially the usual
Nirvana construction.
The new setup produces the same optimal action as before. However, we can now
give an alternative description of the decision rule.
For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an
infra-Bayesian representation of the belief "Omega will make prediction p". For
any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be
interpreted as the belief "assuming Omega is accurate, the expected reward will
be at least u".
We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when ∀f∈[0,1]X:Eϕ[f]≥
Eψ[f]. The reversal is needed to make the analogy to logic intuitive. Indeed, ϕ⪯
ψ can be interpreted as "ϕ implies ψ"[2] [#fn-mQXwc4sNgtZSzqodo-2], the meet
operator ∧ can be interpreted as logical conjunction and the join operator ∨ can
be interpreted as logical disjunction.
Claim:
a∗=argmaxa∈Amax{u∈[0,1]∣β′∧Ωa⪯Ru}
(Actually I only checked it when we restrict to crisp infradistributions, in
which case ∧ is intersection of sets and ⪯ is set conta

1Vanessa Kosoy3moTwo deterministic toy models for regret bounds of infra-Bayesian bandits. The
lesson seems to be that equalities are much easier to learn than inequalities.
Model 1: Let A be the space of arms, O the space of outcomes, r:A×O→R the reward
function, X and Y vector spaces, H⊆X the hypothesis space and F:A×O×H→Y a
function s.t. for any fixed a∈A and o∈O, F(a,o):H→Y extends to some linear
operator Ta,o:X→Y. The semantics of hypothesis h∈H is defined by the equation F(
a,o,h)=0 (i.e. an outcome o of action a is consistent with hypothesis h iff this
equation holds).
For any h∈H denote by V(h) the reward promised by h:
V(h):=maxa∈Amino∈O:F(a,o,h)=0r(a,o)
Then, there is an algorithm with mistake bound dimX, as follows. On round n∈N,
let Gn⊆H be the set of unfalsified hypotheses. Choose hn∈S optimistically, i.e.
hn:=argmaxh∈GnV(h)
Choose the arm an recommended by hypothesis hn. Let on∈O be the outcome we
observed, rn:=r(an,on) the reward we received and h∗∈H the (unknown) true
hypothesis.
If rn≥V(hn) then also rn≥V(h∗) (since h∗∈Gn and hence V(h∗)≤V(hn)) and therefore
an wasn't a mistake.
If rn<V(hn) then F(an,on,hn)≠0 (if we had F(an,on,hn)=0 then the minimization in
the definition of V(hn) would include r(an,on)). Hence, hn∉Gn+1=Gn∩kerTan,on.
This implies dimspan(Gn+1)<dimspan(Gn). Obviously this can happen at most dimX
times.
Model 2: Let the spaces of arms and hypotheses be
A:=H:=Sd:={x∈Rd+1∣∥x∥=1}
Let the reward r∈R be the only observable outcome, and the semantics of
hypothesis h∈Sd be r≥h⋅a. Then, the sample complexity cannot be bound by a
polynomial of degree that doesn't depend on d. This is because Murphy can choose
the strategy of producing reward 1−ϵ whenever h⋅a≤1−ϵ. In this case, whatever
arm you sample, in each round you can only exclude ball of radius ≈√2ϵ around
the sampled arm. The number of such balls that fit into the unit sphere is Ω(ϵ−1
2d). So, normalized regret below ϵ cannot be guaranteed in less than that many
rounds.

1Vanessa Kosoy3moOne of the postulates of infra-Bayesianism is the maximin decision rule. Given a
crisp infradistribution Θ, it defines the optimal action to be:
a∗(Θ):=argmaxaminμ∈ΘEμ[U(a)]
Here U is the utility function.
What if we use a different decision rule? Let t∈[0,1] and consider the decision
rule
a∗t(Θ):=argmaxa(tminμ∈ΘEμ[U(a)]+(1−t)maxμ∈ΘEμ[U(a)])
For t=1 we get the usual maximin ("pessimism"), for t=0 we get maximax
("optimism") and for other values of t we get something in the middle (we can
call "t-mism").
It turns out that, in some sense, this new decision rule is actually reducible
to ordinary maximin! Indeed, set
μ∗t:=argmaxμEμ[U(a∗t)]
Θt:=tΘ+(1−t)μ∗t
Then we get
a∗(Θt)=a∗t(Θ)
More precisely, any pessimistically optimal action for Θt is t-mistically
optimal for Θ (the converse need not be true in general, thanks to the arbitrary
choice involved in μ∗t).
To first approximation it means we don't need to consider t-mistic agents since
they are just special cases of "pessimistic" agents. To second approximation, we
need to look at what the transformation of Θ to Θt does to the prior. If we
start with a simplicity prior then the result is still a simplicity prior. If U
has low description complexity and t is not too small then essentially we get
full equivalence between "pessimism" and t-mism. If t is small then we get a
strictly "narrower" prior (for t=0 we are back at ordinary Bayesianism).
However, if U has high description complexity then we get a rather biased
simplicity prior. Maybe the latter sort of prior is worth considering.

1Vanessa Kosoy1yInfra-Bayesianism can be naturally understood as semantics for a certain
non-classical logic. This promises an elegant synthesis between
deductive/symbolic reasoning and inductive/intuitive reasoning, with several
possible applications. Specifically, here we will explain how this can work for
higher-order logic. There might be holes and/or redundancies in the precise
definitions given here, but I'm quite confident the overall idea is sound.
For simplicity, we will only work with crisp infradistributions, although a lot
of this stuff can work for more general types of infradistributions as well.
Therefore, □X will denote the space of crisp infradistribution. Given μ∈□X, S(μ)
⊆ΔX will denote the corresponding convex set. As opposed to previously, we will
include the empty-set, i.e. there is ⊥X∈□X s.t. S(⊥X)=∅. Given p∈ΔX and μ∈□X, p:
μ will mean p∈S(μ). Given μ,ν∈□X, μ⪯ν will mean S(μ)⊆S(ν).
Syntax
Let Tι denote a set which we interpret as the types of individuals (we allow
more than one). We then recursively define the full set of types T by:
* 0∈T (intended meaning: the uninhabited type)
* 1∈T (intended meaning: the one element type)
* If α∈Tι then α∈T
* If α,β∈T then α+β∈T (intended meaning: disjoint union)
* If α,β∈T then α×β∈T (intended meaning: Cartesian product)
* If α∈T then (α)∈T (intended meaning: predicates with argument of type α)
For each α,β∈T, there is a set F0α→β which we interpret as atomic terms of type
α→β. We will denote V0α:=F01→α. Among those we distinguish the logical atomic
terms:
* prαβ∈F0α×β→α
* iαβ∈F0α→α+β
* Symbols we will not list explicitly, that correspond to the algebraic
properties of + and × (commutativity, associativity, distributivity and the
neutrality of 0 and 1). For example, given α,β∈T there is a "commutator" of
type α×β→β×α.
* =α∈V0(α×α)
* diagα∈F0α→α×α
* ()α∈V0((α)×α) (intended meaning: predicate evaluation)
* ⊥∈V0(1)
* ⊤∈V0(1)
* ∨α∈F0(α)×(α)→(α)
* ∃αβ∈F0(α×β)→(β)
* Assume that for each n∈N

2Vanessa Kosoy1yLet's also explicitly describe 0th order and 1st order infra-Bayesian logic
(although they are should be segments of higher-order).
0-th order
Syntax
Let A be the set of propositional variables. We define the language L:
* Any a∈A is also in L
* ⊥∈L
* ⊤∈L
* Given ϕ,ψ∈L, ϕ∧ψ∈L
* Given ϕ,ψ∈L, ϕ∨ψ∈L
Notice there's no negation or implication. We define the set of judgements J:=L×
L. We write judgements as ϕ⊢ψ ("ψ in the context of ϕ"). A theory is a subset of
J.
Semantics
Given T⊆J, a model of T consists of a compact Polish space X and a mapping M:L→□
X. The latter is required to satisfy:
* M(⊥)=⊥X
* M(⊤)=⊤X
* M(ϕ∧ψ)=M(ϕ)∧M(ψ). Here, we define ∧ of infradistributions as intersection of
the corresponding sets
* M(ϕ∨ψ)=M(ϕ)∨M(ψ). Here, we define ∨ of infradistributions as convex hull of
the corresponding sets
* For any ϕ⊢ψ∈T, M(ϕ)⪯M(ψ)
1-st order
Syntax
We define the language using the usual syntax of 1-st order logic, where the
allowed operators are ∧, ∨ and the quantifiers ∀ and ∃. Variables are labeled by
types from some set T. For simplicity, we assume no constants, but it is easy to
introduce them. For any sequence of variables (v1…vn), we denote Lv the set of
formulae whose free variables are a subset of v1…vn. We define the set of
judgements J:=⋃vLv×Lv.
Semantics
Given T⊆J, a model of T consists of
* For every t∈T, a compact Polish space M(t)
* For every ϕ∈Lv where v1…vn have types t1…tn, an element Mv(ϕ) of Xv:=□(∏ni=1M
(ti))
It must satisfy the following:
* Mv(⊥)=⊥Xv
* Mv(⊤)=⊤Xv
* Mv(ϕ∧ψ)=Mv(ϕ)∧Mv(ψ)
* Mv(ϕ∨ψ)=Mv(ϕ)∨Mv(ψ)
* Consider variables u1…un of types t1…tn and variables v1…vm of types s1…sm.
Consider also some σ:{1…n}→{1…m} s.t. sσ(i)=ti. Given ϕ∈Lv, we can form the
substitution ψ:=ϕ[xi=yσ(i)]∈Lu. We also have a mapping fσ:Xv→Xu given by fσ(x
1…xm)=(xσ(1)…xσ(n)). We require Mu(ψ)=f∗(Mv(ϕ))
* Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection
mapping. We require Mv∖vi(∃vi:ϕ)=pr∗(Mv

1Vanessa Kosoy3moThere is a special type of crisp infradistributions that I call "affine
infradistributions": those that, represented as sets, are closed not only under
convex linear combinations but also under affine linear combinations. In other
words, they are intersections between the space of distributions and some closed
affine subspace of the space of signed measures. Conjecture: in 0-th order logic
of affine infradistributions, consistency is polynomial-time decidable (whereas
for classical logic it is ofc NP-hard).
To produce some evidence for the conjecture, let's consider a slightly different
problem. Specifically, introduce a new semantics in which □X is replaced by the
set of linear subspaces of some finite dimensional vector space V. A model M is
required to satisfy:
* M(⊥)=0
* M(⊤)=V
* M(ϕ∧ψ)=M(ϕ)∩M(ψ)
* M(ϕ∨ψ)=M(ϕ)+M(ψ)
* For any ϕ⊢ψ∈T, M(ϕ)⊆M(ψ)
If you wish, this is "non-unitary quantum logic". In this setting, I have a
candidate polynomial-time algorithm for deciding consistency. First, we
transform T into an equivalent theory s.t. all judgments are of the following
forms:
* a=⊥
* a=⊤
* a⊢b
* Pairs of the form c=a∧b, d=a∨b.
Here, a,b,c,d∈A are propositional variables and "ϕ=ψ" is a shorthand for the
pair of judgments ϕ⊢ψ and ψ⊢ϕ.
Second, we make sure that our T also satisfies the following "closure"
properties:
* If a⊢b and b⊢c are in T then so is a⊢c
* If c=a∧b is in T then so are c⊢a and c⊢b
* If c=a∨b is in T then so are a⊢c and b⊢c
* If c=a∧b, d⊢a and d⊢b are in T then so is d⊢c
* If c=a∨b, a⊢d and b⊢d are in T then so is c⊢d
Third, we assign to each a∈A a real-valued variable xa. Then we construct a
linear program for these variables consisting of the following inequalities:
* For any a∈A: 0≤xa≤1
* For any a⊢b in T: xa≤xb
* For any pair c=a∧b and d=a∨b in T: xc+xd=xa+xb
* For any a=⊥: xa=0
* For any a=⊤: xa=1
Conjecture: the theory is consistent if and only if the linear program has a
solution. To see why it might be so, notice tha

1Vanessa Kosoy1yWhen using infra-Bayesian logic to define a simplicity prior, it is natural to
use "axiom circuits" rather than plain formulae. That is, when we write the
axioms defining our hypothesis, we are allowed to introduce "shorthand" symbols
for repeating terms. This doesn't affect the expressiveness, but it does affect
the description length. Indeed, eliminating all the shorthand symbols can
increase the length exponentially.

1Vanessa Kosoy1yInstead of introducing all the "algebrator" logical symbols, we can define T as
the quotient by the equivalence relation defined by the algebraic laws. We then
need only two extra logical atomic terms:
* For any n∈N and σ∈Sn (permutation), denote n:=∑ni=11 and require σ+∈Fn→n
* For any n∈N and σ∈Sn, σ×α∈Fαn→αn
However, if we do this then it's not clear whether deciding that an expression
is a well-formed term can be done in polynomial time. Because, to check that the
types match, we need to test the identity of algebraic expressions and opening
all parentheses might result in something exponentially long.

1Vanessa Kosoy1yActually the Schwartz–Zippel algorithm can easily be adapted to this case (just
imagine that types are variables over Q, and start from testing the identity of
the types appearing inside parentheses), so we can validate expressions in
randomized polynomial time (and, given standard conjectures, in deterministic
polynomial time as well).

One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent beliefs. We think of the system as a game, where every action-observation history h∈(A×O)∗ corresponds

1Vanessa Kosoy2yThere is a deficiency in this "dynamically subjective" regret bound (also can be
called "realizable misalignment" bound) as a candidate formalization of
alignment. It is not robust to scaling down
[https://www.alignmentforum.org/posts/bBdfbWfWxHN9Chjcq/robustness-to-scale]. If
the AI's prior allows it to accurately model the user's beliefs (realizability
assumption), then the criterion seems correct. But, imagine that the user's
beliefs are too complex and an accurate model is not possible. Then the
realizability assumption is violated and the regret bound guarantees nothing.
More precisely, the AI may use incomplete models
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda]
to capture some properties of the user's beliefs and exploit them, but this
might be not good enough. Therefore, such an AI might fall into a dangerous zone
when it is powerful enough to cause catastrophic damage but not powerful enough
to know it shouldn't do it.
To fix this problem, we need to introduce another criterion which has to hold
simultaneously with the misalignment bound. We need that for any reality that
satisfies the basic assumptions built into the prior (such as, the baseline
policy is fairly safe, most questions are fairly safe, human beliefs don't
change too fast etc), the agent will not fail catastrophically. (It would be way
too much to ask it would converge to optimality, it would violate
no-free-lunch.) In order to formalize "not fail catastrophically" I propose the
following definition.
Let's start with the case when the user's preferences and beliefs are
dynamically consistent. Consider some AI-observable event S that might happen in
the world. Consider a candidate learning algorithm πlearn and two auxiliary
policies. The policy πbase→S follows the baseline policy until S happens, at
which time it switches to the subjectively optimal policy. The policy πlearn→S
follows the candidate learning algorithm unt

1Alex Turner2yThis seems quite close (or even identical) to attainable utility preservation
[https://arxiv.org/abs/1902.09725]; if I understand correctly, this echoes
arguments I've made
[https://www.lesswrong.com/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure#wXHJArzDPoYejHuz2]
for why AUP has a good shot of avoiding catastrophes and thereby getting you
something which feels similar to corrigibility.

1Vanessa Kosoy2yThere is some similarity, but there are also major differences. They don't even
have the same type signature. The dangerousness bound is a desideratum that any
given algorithm can either satisfy or not. On the other hand, AUP is a specific
heuristic how to tweak Q-learning. I guess you can consider some kind of regret
bound w.r.t. the AUP reward function, but they will still be very different
conditions.
The reason I pointed out the relation to corrigibility is not because I think
that's the main justification for the dangerousness bound. The motivation for
the dangerousness bound is quite straightforward and self-contained: it is a
formalization of the condition that "if you run this AI, this won't make things
worse than not running the AI", no more and no less. Rather, I pointed the
relation out to help readers compare it with other ways of thinking they might
be familiar with.
From my perspective, the main question is whether satisfying this desideratum is
feasible. I gave some arguments why it might be, but there are also opposite
arguments. Specifically, if you believe that debate is a necessary component of
Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can
become certain that the user would respond in a particular way to a query, but
it cannot become (worst-case) certain that the user would not change eir
response when faced with some rebuttal. You can't (empirically and in the
worst-case) prove a negative.

1Vanessa Kosoy2yDialogic RL assumes that the user has beliefs about the AI's ontology. This
includes the environment(fn1) from the AI's perspective. In other words, the
user needs to have beliefs about the AI's counterfactuals (the things that would
happen if the AI chooses different possible actions). But, what are the
semantics of the AI's counterfactuals from the user's perspective? This is more
or less the same question that was studied by the MIRI-sphere for a while,
starting from Newcomb's paradox, TDT et cetera. Luckily, I now have an answer
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/shortform#TzkG7veQAMMRNh3Pg]
based on the incomplete models formalism. This answer can be applied in this
case also, quite naturally.
Specifically, we assume that there is a sense, meaningful to the user, in which
ey select the AI policy (program the AI). Therefore, from the user's
perspective, the AI policy is a user action. Again from the user's perspective,
the AI's actions and observations are all part of the outcome. The user's
beliefs about the user's counterfactuals can therefore be expressed as σ:Π→Δ(A×O
)ω(fn2), where Π is the space of AI policies(fn3). We assume that for every π∈Π,
σ(π) is consistent with π the natural sense. Such a belief can be transformed
into an incomplete model from the AI's perspective, using the same technique we
used to solve Newcomb-like decision problems, with σ playing the role of Omega.
For a deterministic AI, this model looks like (i) at first, "Murphy" makes a
guess that the AI's policy is π=πguess (ii) The environment behaves according to
the conditional measures of σ(πguess) (iii) If the AI's policy ever deviates
from πguess, the AI immediately enters an eternal "Nirvana" state with maximal
reward. For a stochastic AI, we need to apply the technique with statistical
tests and multiple models alluded to in the link. This can also be generalized
to the setting where the user's beliefs are already an incomplete model, by
adding another step

1Vanessa Kosoy2yAnother notable feature of this approach is its resistance to "attacks from the
future", as opposed to approaches based on forecasting. In the latter, the AI
has to predict some future observation, for example what the user will write
after working on some problem for a long time. In particular, this is how the
distillation step in IDA is normally assumed to work, AFAIU. Such a forecaster
might sample a future in which a UFAI has been instantiated and this UFAI will
exploit this to infiltrate the present. This might result in a self-fulfilling
prophecy, but even if the forecasting is counterfactual (and thus immune to
self-fulfilling prophecies)it can be attacked by a UFAI that came to be for
unrelated reasons. We can ameliorate this by making the forecasting recursive
(i.e. apply multiple distillation & amplification steps) or use some other
technique to compress a lot of "thinking time" into a small interval of physical
time. However, this is still vulnerable to UFAIs that might arise already at
present with a small probability rate (these are likely to exist since our
putative FAI is deployed at a time when technology progressed enough to make
competing AGI projects a real possibility).
Now, compare this to Dialogical RL, as defined via the framework of dynamically
inconsistent beliefs. Dialogical RL might also employ forecasting to sample the
future, presumably more accurate, beliefs of the user. However, if the user is
aware of the possibility of a future attack, this possibility is reflected in
eir beliefs, and the AI will automatically take it into account and deflect it
as much as possible.

1Vanessa Kosoy2yThis approach also obviates the need for an explicit commitment mechanism.
Instead, the AI uses the current user's beliefs about the quality of future user
beliefs to decide whether it should wait for user's beliefs to improve or commit
to an irreversible coarse of action. Sometimes it can also predict the future
user beliefs instead of waiting (predict according to current user beliefs
updated by the AI's observations).

In my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating

A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

Problem 1: In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).

Currently I only have speculations about the solution. But, I have a few desiderata for it:

It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable

I propose to call

metacosmologythe hypothetical field of study which would be concerned with the following questions:This concept is of potential interest for several reasons:

I propose a new formal desideratum for alignment: the

Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. theuser'sbeliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user's policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user's subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI's prior over universes and ϵ... (read more)

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is,

imitation learning algorithms. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes^{[1]}might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans haverealizablefrom the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are nottoocomplex.This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert... (read more)

I have repeatedly argued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can

predictAlice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.A possible counterargument is, we don't need to depart far from Bayesianis

... (read more)This idea was inspired by a correspondence with Adam Shimi.It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript... (read more)

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via

modifying the gamerather than abandoning the notion of Nash equilibrium).The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a

... (read more)repeatedversion. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requSome thoughts about embedded agency.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows:

What kind of agent, and in what conditions, can effectively plan for events after its own death?For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some

... (read more)fixed ontologyThis is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility

... (read more)with respect to the knowledge state of the user. In order to achieve this, we need to establish a communicatiProbably not too original but I haven't seen it clearly written anywhere.There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time:The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to in... (read more)Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best

approximationof the real environment. (Or, the best reward achievable by some space of policies.)In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some

... (read more)incompletedescriptions. BIn the past I considered the learning-theoretic approach to AI theory as somewhat opposed to the formal logic approach popular in MIRI (see also discussion):

desideratafor agents, whereas "logic-AI" usually starts from postulating a logic-based model of the agent ad hoc.I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the

... (read more)deterministicversion of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding maxConsider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re... (read more)

One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be

... (read more)unlearnable, meaMaster post for alignment protocols.

Other relevant shortforms:

There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.

We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let's assume the offline learning setting. Domain D is represented by instance space X, label space Y, distribution μ∈Δ(X×Y) and loss function L:Y×Y→R. Similarly, domain E is represented by inst... (read more)

Epistemic status: most elements are not new, but the synthesis seems useful.Here is an alignment protocol that I call "autocalibrated quantilzed debate" (AQD).

Arguably the biggest concern with naive debate

^{[1]}is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of "arguments and counterargument" doesn't make sense anymore. Let's call utterances that have this property "Lovecraftian". To counter this, I suggest using quantilization. Quanti... (read more)Epistemic status: no claims to novelty, just (possibly) useful terminology.[

EDIT:I increased all the class numbers by 1 in order to admit a new definition of "class I", see child comment.]I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the "potential" class and the latter the "effective" class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).

Class II:Systems that only ever receive synthetic data that has nothing to do with the real worldExamples:

Master post for ideas about infra-Bayesianism.

One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent

... (read more)beliefs. We think of the system as a game, where every action-observation history h∈(A×O)∗ correspondsIn my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of

... (read more)perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulatingA summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

Problem 1:In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).Currently I only have speculations about the solution. But, I have a few desiderata for it:

De... (read more)It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable

... (read more)