I propose to call metacosmology the hypothetical field of study which would be concerned with the following questions:

Studying the space of simple mathematical laws which produce counterfactual universes with intelligent life.

Studying the distribution over utility-function-space (and, more generally, mindspace) of those counterfactual minds.

Studying the distribution of the amount of resources available to the counterfactual civilizations, and broad features of their development trajectories.

Using all of the above to produce a distribution over concretized simulation hypotheses.

This concept is of potential interest for several reasons:

It can be beneficial to actually research metacosmology, in order to draw practical conclusions. However, knowledge of metacosmology can pose an infohazard, and we would need to precommit not to accept blackmail from potential simulators.

The metacosmology knowledge of a superintelligent AI determines the extent to which it poses risk via the influence of potential simulators.

In principle, we might be able to use knowledge of metacosmology in order to engineer an "atheist prior" for the AI that would exclude simulation hypotheses. However, this might be very difficult in practice.

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is, imitation learning algorithms^{[1]} might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.

This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert... (read more)

This seems similar to gaining uploads prior to AGI, and opens up all those
superorg upload-city amplification/distillation constructions which should get
past human level shortly after. In other words, the limitations of the dataset
can be solved by amplification as soon as the AIs are good enough to be used as
building blocks for meaningful amplification, and something human-level-ish
seems good enough for that. Maybe even GPT-n is good enough for that.

2Vanessa Kosoy2y

That is similar to gaining uploads (borrowing terminology from Egan, we can call
them "sideloads"), but it's not obvious amplification/distillation will work. In
the model based on realizability, the distillation step can fail because the
system you're distilling is too computationally complex (hence, too
unrealizable). You can deal with it by upscaling the compute of the learning
algorithm, but that's not better than plain speedup.

1Vladimir Nesov2y

To me this seems to be essentially another limitation of the human Internet
archive dataset: reasoning is presented in an opaque way (most slow/deliberative
thoughts are not in the dataset), so it's necessary to do a lot of guesswork to
figure out how it works. A better dataset both explains and summarizes the
reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3
can do that to an extent by roleplaying Feynman).
Any algorithm can be represented by a habit of thought (Turing machine style if
you must), and if those are in the dataset, they can be learned. The habits of
thought that are simple enough to summarize get summarized and end up requiring
fewer steps. My guess is that the human faculties needed for AGI can be both
represented by sequences of thoughts (probably just text, stream of
consciousness style) and easily learned with current ML. So right now the main
obstruction is that it's not feasible to build a dataset with those faculties
represented explicitly that's good enough and large enough for current
sample-inefficient ML to grok. More compute in the learning algorithm is only
relevant for this to the extent that we get a better dataset generator that can
work on the tasks before it more reliably.

1Vanessa Kosoy2y

I don't see any strong argument why this path will produce superintelligence.
You can have a stream of thought that cannot be accelerated without investing a
proportional amount of compute, while a completely different algorithm would
produce a far superior "stream of thought". In particular, such an approach
cannot differentiate between features of the stream of thought that are
important (meaning that they advance towards the goal) and features of the
stream of though that are unimportant (e.g. different ways to phrase the same
idea). This forces you to solve a task that is potentially much more difficult
than just achieving the goal.

1Vladimir Nesov2y

I was arguing that near human level babblers (including the imitation plateau
you were talking about) should quickly lead to human level AGIs by amplification
via stream of consciousness datasets, which doesn't pose new ML difficulties
other than design of the dataset. Superintelligence follows from that by any of
the same arguments as for uploads leading to AGI (much faster technological
progress; if amplification/distillation of uploads is useful straight away, we
get there faster, but it's not necessary). And amplified babblers should be
stronger than vanilla uploads (at least implausibly well-educated,
well-coordinated, high IQ humans).
For your scenario to be stable, it needs to be impossible (in the near term) to
run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain
less effective than very high IQ humans. Otherwise you get acceleration of
technological progress, including ML. So my point is that feasibility of
imitation plateau depends on absence of compute overhang, not on ML failing to
capture some of the ingredients of human general intelligence.

1Vanessa Kosoy2y

The imitation plateau can definitely be rather short. I also agree that
computational overhang is the major factor here. However, a failure to capture
some of the ingredients can be a cause of low computational overhead, whereas a
success to capture all of the ingredients is a cause of high computational
overhang, because the compute necessary to reach superintelligence might be very
different in those two cases. Using sideloads to accelerate progress might still
require years, whereas an "intrinsic" AGI might lead to the classical "foom"
scenario.
EDIT: Although, since training is typically much more computationally expensive
than deployment, it is likely that the first human-level imitators will already
be significantly sped-up compared to humans, implying that accelerating progress
will be relatively easy. It might still take some time from the first prototype
until such an accelerate-the-progress project, but probably not much longer than
deploying lots of automation.

1Vladimir Nesov2y

I agree. But GPT-3 seems to me like a good estimate for how much compute it
takes to run stream of consciousness imitation learning sideloads (assuming that
learning is done in batches on datasets carefully prepared by non-learning
sideloads, so the cost of learning is less important). And with that estimate we
already have enough compute overhang to accelerate technological progress as
soon as the first amplified babbler AGIs are developed, which, as I argued
above, should happen shortly after babblers actually useful for automation of
human jobs are developed (because generation of stream of consciousness datasets
is a special case of such a job).
So the key things to make imitation plateau last for years are either sideloads
requiring more compute than it looks like (to me) they require, or amplification
of competent babblers into similarly competent AGIs being a hard problem that
takes a long time to solve.

2Vanessa Kosoy2y

Another thing that might happen is a data bottleneck.
Maybe there will be a good enough dataset to produce a sideload that simulates
an "average" person, and that will be enough to automate many jobs, but for a
simulation of a competent AI researcher you would need a more specialized
dataset that will take more time to produce (since there are a lot less
competent AI researchers than people in general).
Moreover, it might be that the sample complexity grows with the duration of
coherent thought that you require. That's because, unless you're training
directly on brain inputs/outputs, non-realizable (computationally complex)
environment influences contaminate the data, and in order to converge you need
to have enough data to average them out, which scales with the length of your
"episodes". Indeed, all convergence results for Bayesian algorithms we have in
the non-realizable setting require ergodicity, and therefore the time of
convergence (= sample complexity) scales with mixing time, which in our case is
determined by episode length.
In such a case, we might discover that many tasks can be automated by sideloads
with short coherence time, but AI research might require substantially longer
coherence times. And, simulating progress requires by design going
off-distribution along certain dimensions which might make things worse.

I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user's beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).

More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user's policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user's subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI's prior over universes and ϵ... (read more)

(Update: I don't think this was 100% right, see here
[https://www.lesswrong.com/posts/SzrmsbkqydpZyPuEh/my-take-on-vanessa-kosoy-s-take-on-agi-safety#4_1_Example___the_Hippocratic_principle__desideratum__and_an_algorithm_that_obeys_it]
for a better version.)
Attempted summary for morons like me: AI is trying to help the human H. They
share access to a single output channel, e.g. a computer keyboard, so that the
actions that H can take are exactly the same as the actions AI can take. Every
step, AI can either take an action, or delegate to H to take an action. Also,
every step, H reports her current assessment of the timeline / probability
distribution for whether she'll succeed at the task, and if so, how soon.
At first, AI will probably delegate to H a lot, and by watching H work, AI will
gradually learn both the human policy (i.e. what H tends to do in different
situations), and how different actions tend to turn out in hindsight from H's
own perspective (e.g., maybe whenever H takes action 17, she tends to declare
shortly afterwards that probability of success now seems much higher than
before—so really H should probably be taking action 17 more often!).
Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate
how different actions will turn out from H's perspective much better than H
herself. In other words, maybe it delegates to H, and H takes action 41, and the
AI is watching this and shaking its head and thinking to itself "gee you dunce
you're gonna regret that", and shortly thereafter the AI is proven correct.
OK, so now what? The naive answer would be: the AI should gradually stop
delegating and start just doing the thing that leads to H feeling maximally
optimistic later on.
But we don't want to do that naive thing. There are two problems:
The first problem is "traps" (a.k.a. catastrophes). Let's say action 0 is Press
The History Eraser Button [https://vimeo.com/126720159]. H never takes that
action. The AI shouldn't either. W

2Vanessa Kosoy2y

This is about right.
Notice that typically we use the AI for tasks which are hard for H. This means
that without the AI's help, H's probability of success will usually be low.
Quantilization-wise, this is a problem: the AI will be able to eliminate those
paths for which H will report failure, but maybe most of the probability mass
among apparent-success paths is still on failure (i.e. the success report is
corrupt). This is why the timeline part is important.
On a typical task, H expects to fail eventually but they don't expect to fail
soon. Therefore, the AI can safely consider a policies of the form "in the
short-term, do something H would do with marginal probability, in the long-term
go back to H's policy". If by the end of the short-term maneuver H reports an
improved prognosis, this can imply that the improvement is genuine (since the AI
knows H is probably uncorrupted at this point). Moreover, it's possible that in
the new prognosis H still doesn't expect to fail soon. This allows performing
another maneuver of the same type. This way, the AI can iteratively steer the
trajectory towards true success.

4Alex Turner1y

The Hippocratic principle seems similar to my concept of non-obstruction
(https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility
[https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility]),
but subjective from the human's beliefs instead of the AI's.

2Vanessa Kosoy1y

Yes, there is some similarity! You could say that a Hippocratic AI needs to be
continuously non-obstructive w.r.t. the set of utility functions and priors the
user could plausibly have, given what the AI knows. Where, by "continuously" I
mean that we are allowed to compare keeping the AI on or turning off at any
given moment.

2Vanessa Kosoy1y

"Corrigibility" is usually defined as the property of AIs who don't resist
modifications by their designers. Why would we want to perform such
modifications? Mainly it's because we made errors in the initial implementation,
and in particular the initial implementation is not aligned. But, this leads to
a paradox: if we assume our initial implementation to be flawed in a way that
destroys alignment, why wouldn't it also be flawed in a way that destroys
corrigibility?
In order to stop passing the recursive buck, we must assume some dimensions
along which our initial implementation is not allowed to be flawed. Therefore,
corrigibility is only a well-posed notion in the context of a particular such
assumption. Seen through this lens, the Hippocratic principle becomes a
particular crystallization of corrigibility. Specifically, the Hippocratic
principle assumes the agent has access to some reliable information about the
user's policy and preferences (be it through timelines, revealed preferences or
anything else).
Importantly, this information can be incomplete, which can motivate altering the
agent along the way. And, the agent will not resist this alteration! Indeed,
resisting the alteration is ruled out unless the AI can conclude with high
confidence (and not just in expectation) that such resistance is harmless. Since
we assumed the information is reliable, and the alteration is beneficial, the AI
cannot reach such a conclusion.
For example, consider an HDTL agent getting upgraded to "Hippocratic CIRL"
(assuming some sophisticated model of relationship between human behavior and
human preferences). In order to resist the modification, the agent would need a
resistance strategy that (i) doesn't deviate too much from the human baseline
and (ii) ends with the user submitting a favorable report. Such a strategy is
quite unlikely to exist.

2Charlie Steiner1y

I think the people most interested in corrigibility are imagining a situation
where we know what we're doing with corrigibility (e.g. we have some grab-bag of
simple properties we want satisfied), but don't even know what we want from
alignment, and then they imagine building an unaligned slightly-sub-human AGI
and poking at it while we "figure out alignment."
Maybe this is a strawman, because the thing I'm describing doesn't make
strategic sense, but I think it does have some model of why we might end up with
something unaligned but corrigible (for at least a short period).

3Vanessa Kosoy1y

The concept of corrigibility was introduced by MIRI, and I don't think that's
their motivation? On my model of MIRI's model, we won't have time to poke at a
slightly subhuman AI, we need to have at least a fairly good notion of what to
do with a superhuman AI upfront. Maybe what you meant is "we won't know how to
construct perfect-utopia-AI, so we will just construct a
prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI
in our leisure". Which, sure, but I don't see what it has to do with
corrigibility.
Corrigibility is neither necessary nor sufficient for safety. It's not strictly
necessary because in theory an AI can resist modifications in some scenarios
while always doing the right thing (although in practice resisting modifications
is an enormous red flag), and it's not sufficient since an AI can be
"corrigible" but cause catastrophic harm before someone notices and fixes it.
What we're supposed to gain from corrigibility is having some margin of error
around alignment, in which case we can decompose alignment as corrigibility +
approximate alignment. But it is underspecified if we don't say along which
dimensions or how big the margin is. If it's infinite margin along all
dimensions then corrigibility and alignment are just isomorphic and there's no
reason to talk about the former.

1Charlie Steiner2y

Very interesting - I'm sad I saw this 6 months late.
After thinking a bit, I'm still not sure if I want this desideratum. It seems to
require a sort of monotonicity, where we can get superhuman performance just by
going through states that humans recognize as good, and not by going through
states that humans would think are weird or scary or unevaluable.
One case where this might come up is in competitive games. Chess AI beats humans
in part because it makes moves that many humans evaluate as bad, but are
actually good. But maybe this example actually supports your proposal - it seems
entirely plausible to make a chess engine that only makes moves that some given
population of humans recognize as good, but is better than any human from that
population.
On the other hand, the humans might be wrong about the reason the move is good,
so that the game is made of a bunch of moves that seem good to humans, but where
the humans are actually wrong about why they're good (from the human
perspective, this looks like regularly having "happy surprises"). We might hope
that such human misevaluations are rare enough that quantilization would lead to
moves on average being well-evaluated by humans, but for chess I think that
might be false! Computers are so much better than humans at chess that a very
large chunk of the best moves according to both humans and the computer will be
ones that humans misevaluate.
Maybe that's more a criticism of quantilizers, not a criticism of this
desideratum. So maybe the chess example supports this being a good thing to
want? But let me keep critiquing quantilizers then :P
If what a powerful AI thinks is best (by an exponential amount) is to turn off
the stars [https://en.wikipedia.org/wiki/Star_lifting]until the universe is
colder, but humans think it's scary and ban the AI from doing scary things, the
AI will still try to turn off the stars in one of the edge-case ways that humans
wouldn't find scary. And if we think being manipulated like

1Vanessa Kosoy2y

When I'm deciding whether to run an AI, I should be maximizing the expectation
of my utility function w.r.t. my belief state. This is just what it means to act
rationally. You can then ask, how is this compatible with trusting another agent
smarter than myself?
One potentially useful model is: I'm good at evaluating and bad at searching
(after all, P≠NP). I can therefore delegate searching to another agent. But, as
you point out, this doesn't account for situations in which I seem to be bad at
evaluating. Moreover, if the AI prior takes an intentional stance towards the
user (in order to help learning their preferences), then the user must be
regarded as good at searching.
A better model is: I'm good at both evaluating and searching, but the AI can
access actions and observations that I cannot. For example, having additional
information can allow it to evaluate better. An important special case is: the
AI is connected to an external computer (Turing RL
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-su?ciently-advanced-agents-use-logic#fEKc88NbDWZavkW9o])
which we can think of as an "oracle". This allows the AI to have additional
information which is purely "logical". We need infra-Bayesianism to formalize
this: the user has Knightian uncertainty over the oracle's outputs entangled
with other beliefs about the universe.
For instance, in the chess example, if I know that a move was produced by
exhaustive game-tree search then I know it's a good move, even without having
the skill to understand why the move is good in any more detail.
Now let's examine short-term quantilization for chess. On each cycle, the AI
finds a short-term strategy leading to a position that the user evaluates as
good, but that the user would require luck to manage on their own. This is
repeated again and again throughout the game, leading to overall play
substantially superior to the user's. On the other hand, this play is not as
good as the AI would achieve if it just optimized

1Charlie Steiner2y

Agree with the first section, though I would like to register my sentiment that
although "good at selecting but missing logical facts" is a better model, it's
still not one I'd want an AI to use when inferring my values.
I think my point is if "turn off the stars" is not a primitive action, but is a
set of states of the world that the AI would overwhelming like to go to, then
the actual primitive actions will get evaluated based on how well they end up
going to that goal state. And since the AI is better at evaluating than us,
we're probably going there.
Another way of looking at this claim is that I'm telling a story about why the
safety bound on quantilizers gets worse when quantilization is iterated.
Iterated quantilization has much worse bounds than quantilizing over the
iterated game, which makes sense if we think of games where the AI evaluates
many actions better than the human.

1Vanessa Kosoy2y

I think you misunderstood how the iterated quantilization works. It does not
work by the AI setting a long-term goal and then charting a path towards that
goal s.t. it doesn't deviate too much from the baseline over every short
interval. Instead, every short-term quantilization is optimizing for the user's
evaluation in the end of this short-term interval.

1Charlie Steiner2y

Ah. I indeed misunderstood, thanks :) I'd read "short-term quantilization" as
quantilizing over short-term policies evaluated according to their expected
utility. My story doesn't make sense if the AI is only trying to push up the
reported value estimates (though that puts a lot of weight on these estimates).

1Adam Shimi2y

I don't understand what you mean here by quantilizing. The meaning I know is to
take a random action over the top \alpha actions, on a given base distribution.
But I don't see a distribution here, or even a clear ordering over actions
(given that we don't have access to the utility function).
I'm probably missing something obvious, but more details would really help.

2Vanessa Kosoy2y

The distribution is the user's policy, and the utility function for this purpose
is the eventual success probability estimated by the user (as part of the
timeline report), in the end of the "maneuver". More precisely, the original
quantilization formalism was for the one-shot setting, but you can easily
generalize it, for example I did it
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670375556/quantilal-control-for-finite-mdps]
for MDPs.

1Adam Shimi2y

Oh, right, that makes a lot of sense.
So is the general idea that we quantilize such that we're choosing in
expectation an action that doesn't have corrupted utility (by intuitively having
something like more than twice as many actions in the quantilization than we
expect to be corrupted), so that we guarantee the probability of following the
manipulation of the learned user report is small?
I also wonder if using the user policy to sample actions isn't limiting, because
then we can only take actions that the user would take. Or do you assume by
default that the support of the user policy is the full action space, so every
action is possible for the AI?

1Vanessa Kosoy2y

Yes, although you probably want much more than twice. Basically, if the
probability of corruption following the user policy is ϵ and your quantilization
fraction is ϕ then the AI's probability of corruption is bounded by ϵϕ.
Obviously it is limiting, but this is the price of safety. Notice, however, that
the quantilization strategy is only an existence proof. In principle, there
might be better strategies, depending on the prior (for example, the AI might be
able to exploit an assumption that the user is quasi-rational). I didn't specify
the AI by quantilization, I specified it by maximizing EU subject to the
Hippocratic constraint. Also, the support is not really the important part: even
if the support is the full action space, some sequences of actions are possible
but so unlikely that the quantilization will never follow them.

This idea was inspired by a correspondence with Adam Shimi.

It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript... (read more)

I am not sure I understand your use of C(U) in the third from last paragraph
where you define goal directed intelligence. As you define C it is a complexity
measure over programs P. I assume this was a typo and you mean K(U)? Or am I
misunderstanding the definition of either U or C?

2Vanessa Kosoy25d

This is not a typo.
I'm imagining that we have a program P that outputs (i) a time discount
parameter γ∈Q∩[0,1), (ii) a circuit for the transition kernel of an automaton
T:S×A×O→S and (iii) a circuit for a reward function r:S→Q (and, ii+iii are
allowed to have a shared component to save computation time complexity). The
utility function is U:(A×O)ω→R defined by
U(x):=(1−γ)∞∑n=0γnr(sxn)
where sx∈Sω is defined recursively by
sxn+1=T(s,xn)

2Vanessa Kosoy3y

Actually, as opposed to what I claimed before, we don't need computational
complexity bounds for this definition to make sense. This is because the
Solomonoff prior is made of computable hypotheses but is uncomputable itself.
Given g>0, we define that "π has (unbounded) goal-directed intelligence (at
least) g" when there is a prior ζ and utility function U s.t. for any policy π′,
if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior
and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match
the expected utility of π; in particular, this implies π is optimal since any
policy can be approximated by a computable policy), we say that π is "perfectly
(unbounded) goal-directed".
Compare this notion to the Legg-Hutter intelligence measure. The LH measure
depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI
(which is the maximum of the LH measure) becomes computable or even really
stupid. For example, it can always keep taking the same action because of the
fear that taking any other action leads to an inescapable "hell" state. On the
other hand, goal-directed intelligence differs only by O(1) between UTMs, just
like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be
uncomputable, and the notion of which policies are such doesn't depend on the
UTM at all.
I think that it's also possible to prove that intelligence is rare, in the sense
that, for any computable stochastic policy, if we regard it as a probability
measure over deterministic policies, then for any ϵ>0 there is g s.t. the
probability to get intelligence at least g is smaller than ϵ.
Also interesting is that, for bounded goal-directed intelligence, increasing the
prices can only decrease intelligence by O(1), and a policy that is perfectly
goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think).
In particular, a perfectly unbounded goal-directed policy is perfectly
goal-directed for any price vec

1Vanessa Kosoy3y

Some problems to work on regarding goal-directed intelligence. Conjecture 5 is
especially important for deconfusing basic questions in alignment, as it stands
in opposition to Stuart Armstrong's thesis about the impossibility to deduce
preferences from behavior alone.
1. Conjecture. Informally: It is unlikely to produce intelligence by chance.
Formally: Denote Π the space of deterministic policies, and consider some
μ∈ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then,
Eπ∼μ[g(π)]=O(C(π∗)).
2. Find an "intelligence hierarchy theorem". That is, find an increasing
sequence {gn} s.t. for every n, there is a policy with goal-directed
intelligence in (gn,gn+1) (no more and no less).
3. What is the computational complexity of evaluating g given (i) oracle access
to the policy or (ii) description of the policy as a program or automaton?
4. What is the computational complexity of producing a policy with given g?
5. Conjecture. Informally: Intelligent agents have well defined priors and
utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞,
and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with
intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t. π has
intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ)
and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.

1David Manheim2y

re: #5, that doesn't seem to claim that we can infer U given their actions,
which is what the impossibility of deducing preferences is actually claiming.
That is, assuming 5, we still cannot show that there isn't some U1≠U2 such
that π∗(U1,ζ)=π∗(U2,ζ).
(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and
basic result in the decision theory / economics / philosophy literature.)

1Vanessa Kosoy2y

You misunderstand the intent. We're talking about inverse reinforcement
learning. The goal is not necessarily inferring the unknown U, but producing
some behavior that optimizes the unknown U. Ofc if the policy you're observing
is optimal then it's trivial to do so by following the same policy. But, using
my approach we might be able to extend it into results like "the policy you're
observing is optimal w.r.t. certain computational complexity, and your goal is
to produce an optimal policy w.r.t. higher computational complexity."
(Btw I think the formal statement I gave for 5 is false, but there might be an
alternative version that works.)
I am referring to this
[http://papers.neurips.cc/paper/7803-occams-razor-is-insufficient-to-infer-the-preferences-of-irrational-agents]
and related work by Armstrong.

1David Scott Krueger2mo

Apologies, I didn't take the time to understand all of this yet, but I have a
basic question you might have an answer to...
We know how to map (deterministic) policies to reward functions using the
construction at the bottom of page 6 of the reward modelling agenda
(https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so
far done exactly what the policy would do. I think of this as a wrapper
function (https://en.wikipedia.org/wiki/Wrapper_function).
It seems like this means that, for any policy, we can represent it as optimizing
reward with only the minimal overhead in description/computational complexity of
the wrapper.
So...
* Do you think this analysis is correct? Or what is it missing? (maybe the
assumption that the policy is deterministic is significant? This turns out
to be the case for Orseau et al.'s "Agents and Devices" approach, I think
https://arxiv.org/abs/1805.12387).
* Are you trying to get around this somehow? Or are you fine with this minimal
overhead being used to distinguish goal-directed from non-goal directed
policies?

2Vanessa Kosoy2mo

My framework discards such contrived reward functions because it penalizes for
the complexity of the reward function. In the construction you describe, we have
C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand,
policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U
which "justifies" this g. In other words, your "minimal" overhead is very large
from my point of view: to be acceptable, the "overhead" should be substantially
negative.

1David Scott Krueger2mo

I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant
$e$ (representing the wrapper). It seems like any compression you can apply to
the reward function can be translated to the policy via the wrapper. So then
you would never have $C(\pi) >> C(U)$. What am I missing/misunderstanding?

2Vanessa Kosoy2mo

For the contrived reward function you suggested, we would never have C(π)≫C(U).
But for other reward functions, it is possible that C(π)≫C(U). Which is exactly
why this framework rejects the contrived reward function in favor of those other
reward functions. And also why this framework considers some policies
unintelligent (despite the availability of the contrived reward function) and
other policies intelligent.

I haverepeatedlyargued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.

A possible counterargument is, we don't need to depart far from Bayesianis

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).

The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requ

We can modify the population game setting to study superrationality. In order to
do this, we can allow the agents to see a fixed size finite portion of the their
opponents' histories. This should lead to superrationality for the same reasons
I discussed
[https://www.alignmentforum.org/posts/S3W4Xrmp6AL7nxRHd/formalising-decision-theory-is-hard#3yw2udyFfvnRC8Btr]
before
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670375058/superrationality-in-arbitrary-games].
More generally, we can probably allow each agent to submit a finite state
automaton of limited size, s.t. the opponent history is processed by the
automaton and the result becomes known to the agent.
What is unclear about this is how to define an analogous setting based on source
code introspection. While arguably seeing the entire history is equivalent to
seeing the entire source code, seeing part of the history, or processing the
history through a finite state automaton, might be equivalent to some limited
access to source code, but I don't know to define this limitation.
EDIT: Actually, the obvious analogue is processing the source code through a
finite state automaton.

1Vanessa Kosoy3y

Instead of postulating access to a portion of the history or some kind of
limited access to the opponent's source code, we can consider agents with full
access to history / source code but finite memory. The problem is, an agent with
fixed memory size usually cannot have regret going to zero, since it cannot
store probabilities with arbitrary precision. However, it seems plausible that
we can usually get learning with memory of size O(log11−γ). This is because
something like "counting pieces of evidence" should be sufficient. For example,
if consider finite MDPs, then it is enough to remember how many transitions of
each type occurred to encode the belief state. There question is, does assuming
O(log11−γ) memory (or whatever is needed for learning) is enough to reach
superrationality.

1Gurkenglas3y

What do you mean by equivalent? The entire history doesn't say what the opponent
will do later or would do against other agents, and the source code may not
allow you to prove what the agent does if it involves statements that are true
but not provable.

1Vanessa Kosoy3y

For a fixed policy, the history is the only thing you need to know in order to
simulate the agent on a given round. In this sense, seeing the history is
equivalent to seeing the source code.
The claim is: In settings where the agent has unlimited memory and sees the
entire history or source code, you can't get good guarantees (as in the folk
theorem for repeated games). On the other hand, in settings where the agent sees
part of the history, or is constrained to have finite memory (possibly of size
O(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or
some other strong desideratum that deserves to be called "superrationality".

1Vanessa Kosoy3y

In the previous "population game" setting, we assumed all players are "born" at
the same time and learn synchronously, so that they always play against players
of the same "age" (history length). Instead, we can consider a "mortal
population game" setting where each player has a probability 1−γ to die on every
round, and new players are born to replenish the dead. So, if the size of the
population is N (we always consider the "thermodynamic" N→∞ limit), N(1−γ)
players die and the same number of players are born on every round. Each
player's utility function is a simple sum of rewards over time, so, taking
mortality into account, effectively ey have geometric time discount. (We could
use age-dependent mortality rates to get different discount shapes, or allow
each type of player to have different mortality=discount rate.) Crucially, we
group the players into games randomly, independent of age.
As before, each player type i chooses a policy πi:On→ΔAi. (We can also consider
the case where players of the same type may have different policies, but let's
keep it simple for now.) In the thermodynamic limit, the population is described
as a distribution over histories, which now are allowed to be of variable
length: μn∈ΔO∗. For each assignment of policies to player types, we get dynamics
μn+1=Tπ(μn) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games,
mortal population games naturally give rise to dynamical systems.
If we consider only the age distribution, then its evolution doesn't depend on π
and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk.
Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that
corresponds to the age distribution ζ. We denote it P.
Does the dynamics have fixed points? O∗ can be regarded as a subspace of
(O⊔{⊥})ω. The latter is compact (in the product topology) by Tychonoff's theorem
and Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability
measure spaces, Δ(O⊔{⊥})ω is also

PRECURSOR DETECTION, CLASSIFICATION AND ASSISTANCE (PREDCA)
Infra-Bayesian physicalism
[https://www.alignmentforum.org/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized]
provides us with two key building blocks:
* Given a hypothesis about the universe, we can tell which programs are
running. (This is just the bridge transform.)
* Given a program, we can tell whether it is an agent, and if so, which utility
function it has[1] (the "evaluating agent" section of the article).
I will now outline how we can use these building blocks to solve both the inner
and outer alignment problem. The rough idea is:
* For each hypothesis in the prior, check which agents are precursors of our
agent according to this hypothesis.
* Among the precursors, check whether some are definitely neither humans nor
animals nor previously created AIs.
* If there are precursors like that, discard the hypothesis (it is probably a
malign simulation hypothesis).
* If there are no precursors like that, decide which of them are humans.
* Follow an aggregate of the utility functions of the human precursors
(conditional on the given hypothesis).
DETECTION
How to identify agents which are our agent's precursors? Let our agent be G and
let H be another agents which exists in the universe according to hypothesis
Θ[2]. Then, H is considered to be a precursor of G in universe Θ when there is
some H-policy σ s.t. applying the counterfactual "H follows σ" to Θ (in the
usual infra-Bayesian sense) causes G not to exist (i.e. its source code doesn't
run).
A possible complication is, what if Θ implies that H creates G / doesn't
interfere with the creation of G? In this case H might conceptually be a
precursor, but the definition would not detect it. It is possible that any such
Θ would have a sufficiently large description complexity penalty that it doesn't
matter. On the second hand, if Θ is unconditionally Knightian uncertain about H
creating G then

2Vanessa Kosoy7mo

A question that often comes up in discussion of IRL: are agency and values
purely behavioral concepts, or do they depend on how the system produces its
behavior? The cartesian measure of agency I proposed
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=ovBmi2QFikE6CRWtj]
seems purely behavioral, since it only depends on the policy. The physicalist
version seems less so since it depends on the source code, but this difference
might be minor: this role of the source is merely telling the agent "where" it
is in the universe. However, on closer examination, the physicalist g is far
from purely behaviorist, and this is true even for cartesian Turing RL. Indeed,
the policy describes not only the agent's interaction with the actual
environment but also its interaction with the "envelope" computer. In a sense,
the policy can be said to reflects the agent's "conscious thoughts".
This means that specifying an agent requires not only specifying its source code
but also the "envelope semantics" C (possibly we also need to penalize for the
complexity of C in the definition of g). Identifying that an agent exists
requires not only that its source code is running, but also, at least that its
history h is C-consistent with the α∈2Γ variable of the bridge transform. That
is, for any y∈α we must have dCy for some destiny d⊐h. In other words, we want
any computation the agents ostensibly runs on the envelope to be one that is
physically manifest (it might be this condition isn't sufficiently strong, since
it doesn't seem to establish a causal relation between the manifesting and the
agent's observations, but it's at least necessary).
Notice also that the computational power of the envelope implied by C becomes
another characteristic of the agent's intelligence, together with g as a
function of the cost of computational resources. It might be useful to come up
with natural ways to quantify this power.

2ViktoriaMalyasova7mo

Can you please explain how does this not match the definition? I don't yet
understand all the math, but intuitively, if H creates G / doesn't interfere
with the creation of G, then if H instead followed policy "do not create G/ do
interfere with the creation of G", then G's code wouldn't run?
Can you please give an example of a precursor that does match the definition?

2Vanessa Kosoy7mo

The problem is that if Θ implies that H creates G but you consider a
counterfactual in which H doesn't create G then you get an inconsistent
hypothesis i.e. a HUC which contains only 0. It is not clear what to do with
that. In other words, the usual way of defining counterfactuals in IB (I
tentatively named it "hard counterfactuals") only makes sense when the condition
you're counterfactualizing on is something you have Knightian uncertainty about
(which seems safe to assume if this condition is about your own future action
but not safe to assume in general). In a child post
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=fdeMdyAdTfFN8Rs7N]
I suggested solving this by defining "soft counterfactuals" where you consider
coarsenings of Θ in addition to Θ itself.

2Vanessa Kosoy9mo

Here's a video [https://www.youtube.com/watch?v=24vIJDBSNRI] of a talk I gave
about PreDCA.

2Vanessa Kosoy1y

Two more remarks.
USER DETECTION
It can be useful to identify and assist specifically the user rather than e.g.
any human that ever lived (and maybe some hominids). For this purpose I propose
the following method. It also strengthens the protocol by relieving some
pressure from other classification criteria.
Given two agents G and H, which can ask which points on G's timeline are in the
causal past of which points of H's timeline. To answer this, consider the
counterfactual in which G takes a random action (or sequence of actions) at some
point (or interval) on G's timeline, and measure the mutual information between
this action(s) and H's observations at some interval on H's timeline.
Using this, we can effectively construct a future "causal cone" emanating from
the AI's origin, and also a past causal cone emanating from some time t on the
AI's timeline. Then, "nearby" agents will meet the intersection of these cones
for low values of t whereas "faraway" agents will only meet it for high values
of t or not at all. To first approximation, the user would be the "nearest"
precursor[1] agent i.e. the one meeting the intersection for the minimal t.
More precisely, we expect the user's observations to have nearly maximal mutual
information with the AI's actions: the user can e.g. see every symbol the AI
outputs to the display. However, the other direction is less clear: can the AI's
sensors measure every nerve signal emanating from the user's brain? To address
this, we can fix t to a value s.t. we expect only the user the meet the
intersection of cones, and have the AI select the agent which meets this
intersection for the highest mutual information threshold.
This probably does not make the detection of malign agents redundant, since
AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make
a malign agent the user.
MORE ON COUNTERFACTUALS
In the parent post I suggested "instead of examining only Θ we also examine
coarsenings of Θ which a

2Vanessa Kosoy7mo

CAUSALITY IN IBP
There seems to be an even more elegant way to define causal relationships
between agents, or more generally between programs. Starting from a hypothesis
Θ∈□(Γ×Φ), for Γ=ΣR, we consider its bridge transform B∈□(Γ×2Γ×Φ). Given some
subset of programs Q⊆R we can define Δ:=ΣQ then project B to BΔ∈□(Γ×2Δ)[1]. We
can then take bridge transform again to get some C∈□(Γ×2Γ×2Δ). The 2Γ factor now
tells us which programs causally affect the manifestation of programs in Q.
Notice that by Proposition 2.8 in the IBP article, when Q=R we just get all
programs that are running, which makes sense.
AGREEMENT RULES OUT MESA-OPTIMIZATION
The version of PreDCA without any explicit malign hypothesis filtering might be
immune to malign hypotheses, and here is why. It seems plausible that IBP admits
an agreement theorem (analogous to Aumann's) which informally amounts to the
following: Given two agents Alice and Bobcat that (i) share the same physical
universe, (ii) have a sufficiently tight causal relationship (each can see what
the other sees), (iii) have unprivileged locations inside the physical universe,
(iv) start from similar/compatible priors and (v) [maybe needed?] similar
utility functions, they converge to similar/compatible beliefs, regardless of
the complexity of translation between their subjective viewpoints. This is
plausible because (i) as opposed to the cartesian framework, different bridge
rules don't lead to different probabilities and (ii) if Bobcat considers a
simulation hypothesis plausible, and the simulation is sufficiently detailed to
fool it indefinitely, then the simulation contains a detailed simulation of
Alice and hence Alice must also consider this to be plausible hypothesis.
If the agreement conjecture is true, then the AI will converge to hypotheses
that all contain the user, in a causal relationship with the AI that affirms
them as the user. Moreover, those hypotheses will be compatible with the user's
own posterior (i.e. the differe

1Martín Soto4mo

Hi Vanessa! Thanks again for your previous answers. I've got one further
concern.
Are all mesa-optimizers really only acausal attackers?
I think mesa-optimizers don't need to be purely contained in a hypothesis
(rendering them acausal attackers), but can be made up of a part of the
hypotheses-updating procedures (maybe this is obvious and you already considered
it).
Of course, since the only way to change the AGI's actions is by changing its
hypotheses, even these mesa-optimizers will have to alter hypothesis selection.
But their whole running program doesn't need to be captured inside any
hypothesis (which would be easier for classifying acausal attackers away).
That is, if we don't think about how the AGI updates its hypotheses, and just
consider them magically updating (without any intermediate computations), then
of course, the only mesa-optimizers will be inside hypotheses. If we actually
think about these computations and consider a brute-force search over all
hypotheses, then again they will only be found inside hypotheses, since the
search algorithm itself is too simple and provides no further room for storing a
subagent (even if the mesa-optimizer somehow takes advantage of the details of
the search). But if more realistically our AGI employs more complex heuristics
to ever-better approximate optimal hypotheses update, mesa-optimizers can be
partially or completely encoded in those (put another way, those non-optimal
methods can fail / be exploited). This failure could be seen as a capabilities
failure (in the trivial sense that it failed to correctly approximate perfect
search), but I think it's better understood as an alignment failure.
The way I see PreDCA (and this might be where I'm wrong) is as an "outer
top-level protocol" which we can fit around any superintelligence of arbitrary
architecture. That is, the superintelligence will only have to carry out the
hypotheses update (plus some trivial calculations over hypotheses to find the
best

3Vanessa Kosoy4mo

First, no, the AGI is not going to "employ complex heuristics to ever-better
approximate optimal hypotheses update". The AGI is going to be based on an
algorithm which, as a mathematical fact (if not proved then at least
conjectured), converges to the correct hypothesis with high probability. Just
like we can prove that e.g. SVMs converge to the optimal hypothesis in the
respective class, or that particular RL algorithms for small MDPs converge to
the correct hypothesis (assuming realizability).
Second, there's the issue of non-cartesian attacks ("hacking the computer").
Assuming that the core computing unit is not powerful enough to mount a
non-cartesian attack on its own, such attacks can arguably be regarded as
detrimental side-effects of running computations on the envelope. My hope is
that we can shape the prior about such side-effects in some informed way (e.g.
the vast majority of programs won't hack the computer) s.t. we still have
approximate learnability (i.e. the system is not too afraid to run computations)
without misspecification (i.e. the system is not overconfident about the safety
of running computations). The more effort we put into hardening the system, the
easier it should be to find such a sweet spot.
Third, I hope that the agreement solution will completely rule out any
undesirable hypothesis, because we will have an actual theorem that guarantees
it. What are the exact assumptions going to be and what needs to be done to make
sure these assumptions hold is work for the future, ofc.

2Vanessa Kosoy1y

Some additional thoughts.
NON-CARTESIAN DAEMONS
[https://www.lesswrong.com/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda#Taming_daemons]
These are notoriously difficult to deal with. The only methods I know are that
applicable to other protocols are homomorphic cryptography and quantilization of
envelope (external computer) actions. But, in this protocol, they are dealt with
the same as Cartesian daemons! At least if we assume a non-Cartesian attack
requires an envelope action, the malign hypotheses which are would-be sources of
such actions are discarded without giving an opportunity for attack.
WEAKNESSES
My main concerns with this approach are:
* The possibility of major conceptual holes in the definition of precursors.
More informal analysis can help, but ultimately mathematical research in
infra-Bayesian physicalism in general and infra-Bayesian
cartesian/physicalist multi-agent
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=uZ5xq73xmZSTSZN33]
interactions in particular is required to gain sufficient confidence.
* The feasibility of a good enough classifier. At present, I don't have a
concrete plan for attacking this, as it requires inputs from outside of
computer science.
* Inherent "incorrigibility": once the AI becomes sufficiently confident that
it correctly detected and classified its precursors, its plans won't defer to
the users any more than the resulting utility function demands. On the second
hand, I think the concept of corrigibility is underspecified
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=5Rxgkzqr8XsBwcEQB]
so much that I'm not sure it is solved (rather than dissolved) even in the
Book
[https://www.lesswrong.com/posts/34Gkqus9vusXRevR8/late-2021-miri-conversations-ama-discussion?commentId=PYHHJkHcS55ekmWEE].
Moreover, the concern can be ameliorated by sufficiently powerful
interpretability tools. It

3Vanessa Kosoy1y

There's a class of AI risk mitigation strategies which relies on the users to
perform the pivotal act using tools created by AI (e.g. nanosystems). These
strategies are especially appealing if we want to avoid human models. Here is a
concrete alignment protocol for these strategies, closely related to AQD
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=h3Ww6nyt9fpj7BLyo],
which we call autocalibrating quantilized RL (AQRL).
First, suppose that we are able formulate the task as episodic RL with a
formally specified reward function. The reward function is necessarily only a
proxy for our true goal, since it doesn't contain terms such as "oh btw don't
kill people while you're building the nanosystem". However, suppose the task is
s.t. accomplishing it in the intended way (without Goodharting or causing
catastrophic side effects) is easier than performing any attack. We will call
this the "relative difficulty assumption" (RDA). Then, there exists a value for
the quantilization parameter s.t. quantilized RL performs the task in the
intended way.
We might not know how to set the quantilization parameter on our own, but we can
define a performance goal for the task (in terms of expected total reward) s.t.
the RDA holds. This leads to algorithms which gradually tune the quantilization
parameter until the performance goal is met, while maintaining a proper balance
between safety and sample complexity. Here it is important to keep track of
epistemic vs. aleatoric uncertainty: the performance goal is the expectation of
total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a
given hypothesis), whereas the safety goal is a bound on the expected cost of
overshooting the optimal quantilization parameter relatively to both aleatoric
and epistemic uncertainty (i.e. uncertainty between different hypotheses). This
secures the system against malign hypotheses that are trying to cause an
overshoot.
Notice the hardening

Probably not too original but I haven't seen it clearly written anywhere.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for "malign AI is attacking", but that still leaves us open to surprise takeovers in which there is no chance to press the button.

Amplifying by subjective time: The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a "groundhog day" scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than... (read more)

I think this would make a good top-level post. I have the feeling I’ll want to
link to it later.

1Vanessa Kosoy2y

I retracted part of that, see the edit.

1Vanessa Kosoy2y

More observations about this attack vector ("attack from counterfactuals"). I
focus on "amplifying by subjective time".
* The harder the takeoff the more dangerous this attack vector: During every
simulation cycle, ability to defend against simulated malign AI depends on
the power of the defense system in the beginning of the cycle[1]. On the
other hand, the capability of the attacker depends on its power in the end of
the cycle. Therefore, if power grows very fast this is bad news for the
defender. On the other hand, if power grows very slowly, the defender wins
because defense is probably easier.
* Inner control of anchor makes system safer: Given a fixed objective time
anchor, the more times the simulation goes through a cycle, the higher its
exposure to the threat, since improving the defense system takes time even if
you have all the knowledge. Therefore, we want to give the simulated user the
ability to move the objective time anchor forward in time, in order to
benefit from improvements in the defense system.
* Additional information about the external world makes system safer: Suppose
our subjective probability of a malign AI successfully attacking during a
single simulation cycle is 0.1%. Then performing 1000 cycles superficially
yields a failure probability of about 37%. However, this is only the case if
attacks on different cycles are independent random variables. If the AI knows
as much about the possibility of such an attack as we do, this is a justified
assumption. However, if the AI knows more, then, from our subjective
perspective, these variables are positively correlated. In particular, if the
AI knows with certainty whether a successful attack would occur or not the
failure probability over 1000 cycles goes back to 0.1%. Therefore, it is
beneficial to give the AI extra information about the world, just so it can
estimate the probability of attack with more confidence.
---------

In the anthropic trilemma
[https://www.lesswrong.com/posts/y7jZ9BLEeuNTzgAE5/the-anthropic-trilemma],
Yudkowsky writes about the thorny problem of understanding subjective
probability in a setting where copying and modifying minds is possible. Here, I
will argue that infra-Bayesianism (IB) leads to the solution.
Consider a population of robots, each of which in a regular RL agent. The
environment produces the observations of the robots, but can also make copies or
delete portions of their memories. If we consider a random robot sampled from
the population, the history they observed will be biased compared to the
"physical" baseline. Indeed, suppose that a particular observation c has the
property that every time a robot makes it, 10 copies of them are created in the
next moment. Then, a random robot will have c much more often in their history
than the physical frequency with which c is encountered, due to the resulting
"selection bias". We call this setting "anthropic RL" (ARL).
The original motivation for IB was non-realizability. But, in ARL, Bayesianism
runs into issues even when the environment is realizable from the "physical"
perspective. For example, we can consider an "anthropic MDP" (AMDP). An AMDP has
finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗).
The output is a string of states instead of a single state, because many copies
of the agent might be instantiated on the next round, each with their own state.
In general, there will be no single Bayesian hypothesis that captures the
distribution over histories that the average robot sees at any given moment of
time (at any given moment of time we sample a robot out of the population and
look at their history). This is because the distributions at different moments
of time are mutually inconsistent.
[EDIT: Actually, given that we don't care about the order of robots, the
signature of the transition kernel should be T:A×S→ΔNS]
The consistency that is violated is exactly the c

1Charlie Steiner2y

Could you expand a little on why you say that no Bayesian hypothesis captures
the distribution over robot-histories at different times? It seems like you can
unroll an AMDP into a "memory MDP" that puts memory information of the robot
into the state, thus allowing Bayesian calculation of the distribution over
states in the memory MDP to capture history information in the AMDP.

1Vanessa Kosoy2y

I'm not sure what do you mean by that "unrolling". Can you write a mathematical
definition?
Let's consider a simple example. There are two states: s0 and s1. There is just
one action so we can ignore it. s0 is the initial state. An s0 robot transition
into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How
will our population look like?
0th step: all robots remember s0
1st step: all robots remember s0s1
2nd step: 1/2 of robots remember s0s1s0 and 1/2 of robots remember s0s1s1
3rd step: 1/3 of robots remembers s0s1s0s1, 1/3 of robots remember s0s1s1s0 and
1/3 of robots remember s0s1s1s1
There is no Bayesian hypothesis a robot can have that gives correct predictions
both for step 2 and step 3. Indeed, to be consistent with step 2 we must have
Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have
Pr[s0s1s0]=13, Pr[s0s1s1]=23.
In other words, there is no Bayesian hypothesis s.t. we can guarantee that a
randomly sampled robot on a sufficiently late time step will have learned this
hypothesis with high probability. The apparent transition probabilities keep
shifting s.t. it might always continue to seem that the world is complicated
enough to prevent our robot from having learned it already.
Or, at least it's not obvious there is such a hypothesis. In this example,
Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do
all probabilities converge fast enough for learning to happen, in general? I
don't know, maybe for finite state spaces it can work. Would definitely be
interesting to check.
[EDIT: actually, in this example there is such a hypothesis but in general there
isn't, see below
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=E58br2mJWbgzQqZhX]]

1Charlie Steiner2y

Great example. At least for the purposes of explaining what I mean :) The memory
AMDP would just replace the states s0, s1 with the memory
states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to
memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and
another in [s0,s1,s1].
(Skip this paragraph unless the specifics of what's going on aren't obvious:
given a transition distribution P(s′∗|s,π) (P being the distribution over sets
of states s'* given starting state s and policy π), we can define the memory
transition distribution P(s′∗m|sm,π) given policy π and starting "memory
state" sm∈S∗ (Note that this star actually does mean finite sequences, sorry for
notational ugliness). First we plug the last element of sm into the transition
distribution as the current state. Then for each s′∗ in the domain, for each
element in s′∗ we concatenate that element onto the end of sm and collect
these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)
So now at time t=2, if you sample a robot, the probability that its state begins
with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability
changes to 0.66. This is the same result as for the regular MDP, it's just that
we've turned a question about the history of agents, which may be ill-defined,
into a question about which states agents are in.
I'm still confused about what you mean by "Bayesian hypothesis" though. Do you
mean a hypothesis that takes the form of a non-anthropic MDP?

1Vanessa Kosoy2y

I'm not quite sure what are you trying to say here, probably my explanation of
the framework was lacking. The robots already remember the history, like in
classical RL. The question about the histories is perfectly well-defined. In
other words, we are already implicitly doing what you described. It's like in
classical RL theory, when you're proving a regret bound or whatever, your
probability space consists of histories.
Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then
any environment can be regarded as an MDP (whose states are histories). That is,
I'm talking about hypotheses which conform to the classical "cybernetic agent
model". If you wish, we can call it "Bayesian cybernetic hypothesis".
Also, I want to clarify something I was myself confused about in the previous
comment. For an anthropic Markov chain (when there is only one action) with a
finite number of states, we can give a Bayesian cybernetic description, but for
a general anthropic MDP we cannot even if the number of states is finite.
Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+.
Assuming the chain is communicating, ET is an irreducible non-negative matrix,
so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal
eigenvector η∈RS+. We then get the subjective transition kernel:
ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′
Now, consider the following example of an AMDP. There are three actions
A:={a,b,c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates
two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot.
When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply
b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it
results in one robot whose state is s0 with probability 12 and s1 with
probability 12.
Consider the following two policies. πa takes the sequence of actions cacaca…
and πb takes the sequence of actions cbcbcb…. A population that follo

1Charlie Steiner2y

Ah, okay, I see what you mean. Like how preferences are divisible into "selfish"
and "worldly" components, where the selfish component is what's impacted by a
future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to "wordly" and just
sum the reward calculated at individual timesteps, or analogous to "selfish" and
calculated by taking the limit of the subjective distribution over parts of the
history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn't understand what you
were getting at, so I was concerned that something unrealistic was going on. For
example, if you assume that the agent can remember its history, how can you
possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you're talking about
the subjective probability of a state given a policy and a timestep, but if the
agents know their histories there is no actual agent in the information-state
that corresponds to having those probabilities. In an MDP the agents just have
probabilities over transitions - so maybe a clearer example is an agent that
copies itself if it wins the lottery having a larger subjective transition
probability of going from gambling to winning. (i.e. states are losing and
winning, actions are gamble and copy, the policy is to gamble until you win and
then copy).

1Vanessa Kosoy2y

AMDP is only a toy model that distills the core difficulty into more or less the
simplest non-trivial framework. The rewards are "selfish": there is a reward
function r:(S×A)∗→R which allows assigning utilities to histories by time
discounted summation, and we consider the expected utility of a random robot
sampled from a late population. And, there is no memory wiping. To describe
memory wiping we indeed need to do the "unrolling" you suggested. (Notice that
from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=SBPzgAZgFFxtL9E64],
but (i) instead of A×O labels use A×M labels, where M is the set of possible
memory states (a policy is then described by π:M→A), to allow for agents that
don't fully trust their memory (ii) consider another chain with a bigger state
space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible.
Here, the semantics of p(s) is: the multiset of ontological states resulting
from interpreting the physical state s by taking the viewpoints of different
agents s contains.
I didn't understand "no actual agent in the information-state that corresponds
to having those probabilities". What does it mean to have an agent in the
information-state?

1Charlie Steiner2y

Nevermind, I think I was just looking at it with the wrong class of reward
function in mind.

3Vanessa Kosoy10mo

Infra-Bayesian physicalism
[https://www.alignmentforum.org/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized]
is an interesting example in favor of the thesis that the more qualitatively
capable an agent is, the less corrigible it is. (a.k.a. "corrigibility is
anti-natural to consequentialist reasoning"). Specifically, alignment protocols
that don't rely on value learning become vastly less safe when combined with
IBP:
* Example 1: Using steep time discount to disincentivize dangerous long-term
plans. For IBP, "steep time discount" just means, predominantly caring about
your source code running with particular short inputs. Such a goal strongly
incentives the usual convergent instrumental goals: first take over the
world, then run your source code with whatever inputs you want. IBP agents
just don't have time discount in the usual sense: a program running late in
physical time is just as good as one running early in physical time.
* Example 2: Debate. This protocol relies on a zero-sum game between two AIs.
But, the monotonicity principle rules out the possibility of zero-sum! (If L
and −L are both monotonic loss functions then L is a constant). So, in a
"debate" between IBP agents, they cooperate to take over the world and then
run the source code of each debater with the input "I won the debate".
* Example 3: Forecasting/imitation (an IDA in particular). For an IBP agent,
the incentivized strategy is: take over the world, then run yourself with
inputs showing you making perfect forecasts.
The conclusion seems to be, it is counterproductive to use IBP to solve the
acausal attack problem for most protocols. Instead, you need to do PreDCA
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/shortform?commentId=vKw6DB9crncovPxED]
or something similar. And, if acausal attack is a serious problem, then
approaches that don't do value learning might be doomed.

2Vanessa Kosoy2mo

The following was written by me during the "Finding the Right Abstractions for
healthy systems" research workshop, hosted by Topos Institute in January 2023.
However, I invented the idea before.
Here's an elegant diagrammatic notation for constructing new infrakernels out of
given infrakernels. There is probably some natural category-theoretic way to
think about it, but at present I don't know what it is.
By “infrakernel” we will mean a continuous mapping of the form X→□Y, where X and
Y are compact Polish spaces and □Y is the space of credal sets (i.e. closed
convex sets of probability distributions) over Y.
SYNTAX
* The diagram consists of child vertices, parent vertices, squiggly lines,
arrows, dashed arrows and slashes.
* There can be solid arrows incoming into the diagram. Each such arrow a is
labeled by a compact Polish space D(a) and ends on a parent vertex t(a). And,
s(a)=⊥ (i.e. the arrow has no source vertex).
* There can be dashed and solid arrows between vertices. Each such arrow a
starts from a child vertex s(a) and ends on a parent vertex t(a). We require
that P(s(a))≠t(a) (i.e. they should not be also connected by a squiggly
line).
* There are two types of vertices: parent vertices (denoted by a letter) and
child vertices (denoted by a letter or number in a circle).
* Each child vertex v is labeled by a compact Polish space D(v) and connected
(by a squiggly line) to a unique parent vertex P(v). It may or may not be
crossed-out by a slash.
* Each parent vertex p is labeled by an infrakernel Kp with source S1×…×Sk
and target T1×…×Tl where each Si is corresponds to a solid arrow a with
t(a)=p and each Tj is D(v) for some child vertex v with P(v)=p. We can also
add squares with numbers where solid arrows end to keep track of the
correspondence between the arguments of Kp and the arrows.
* If s(a)=⊥ then the corresponding Si is D(a).
* If s(a)=v≠⊥ then the corresponding Si is D(v).

2Vanessa Kosoy8mo

Master post for ideas about infra-Bayesian physicalism
[https://www.alignmentforum.org/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized].
Other relevant posts:
* Incorrigibility in IBP
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=S6owhpnREkXg8Wfhz]
* PreDCA
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=vKw6DB9crncovPxED]
alignment protocol

2Vanessa Kosoy2mo

Up to light editing, the following was written by me during the "Finding the
Right Abstractions for healthy systems" research workshop, hosted by Topos
Institute in January 2023. However, I invented the idea before.
In order to allow R (the set of programs) to be infinite in IBP, we need to
define the bridge transform for infinite Γ. At first, it might seem Γ can be
allowed to be any compact Polish space, and the bridge transform should only
depend on the topology on Γ, but that runs into problems. Instead, the right
structure on Γ for defining the bridge transform seems to be that of a
"profinite field space": a category I came up with that I haven't seen in the
literature so far.
The category PFS of profinite field spaces is defined as follows. An object F of
PFS is a set ind(F) and a family of finite sets Fαα∈ind(F). We denote
Tot(F):=∏αFα. Given F and G objects of PFS, a morphism from F to G is a mapping
f:Tot(F)→Tot(G) such that there exists R⊆ind(F)×ind(G) with the following
properties:
* For any α∈ind(F), the set R(α):=β∈ind(G)∣(α,β)∈R is finite.
* For any β∈ind(G), the set R−1(β):=α∈ind(F)∣(α,β)∈R is finite.
* For any β∈ind(G), there exists a mapping fβ:∏α∈R−1(β)Fα→Gβ s.t. for any
x∈Tot(F), f(x)β:=fβ(prRβ(x)) where prRβ:Tot(F)→∏α∈R−1(β)Fα is the projection
mapping.
The composition of PFS morphisms is just the composition of mappings.
It is easy to see that every PFS morphism is a continuous mapping in the product
topology, but the converse is false. However, the converse is true for objects
with finite ind (i.e. for such objects any mapping is a morphism). Hence, an
object F in PFS can be thought of as Tot(F) equipped with additional structure
that is stronger than the topology but weaker than the factorization into Fα.
The name "field space" is inspired by the following observation. Given F an
object of PFS, there is a natural condition we can impose on a Borel probability
distribution on Tot(F) which makes it a “Markov random field
[https://

2Vanessa Kosoy1y

Infradistributions admit an information-theoretic quantity that doesn't exist in
classical theory. Namely, it's a quantity that measures how many bits of
Knightian uncertainty an infradistribution has. We define it as follows:
Let X be a finite set and Θ a crisp infradistribution (credal set) on X, i.e. a
closed convex subset of ΔX. Then, imagine someone trying to communicate a
message by choosing a distribution out of Θ. Formally, let Y be any other finite
set (space of messages), θ∈ΔY (prior over messages) and K:Y→Θ (communication
protocol). Consider the distribution η:=θ⋉K∈Δ(Y×X). Then, the information
capacity of the protocol is the mutual information between the projection on Y
and the projection on X according to η, i.e. Iη(prX;prY). The "Knightian
entropy" of Θ is now defined to be the maximum of Iη(prX;prY) over all choices
of Y, θ, K. For example, if Θ is Bayesian then it's 0, whereas if Θ=⊤X, it is
ln|X|.
Here is one application[1] of this concept, orthogonal to infra-Bayesianism
itself. Suppose we model inner alignment by assuming that some portion ϵ of the
prior ζ consists of malign hypotheses. And we want to design e.g. a prediction
algorithm that will converge to good predictions without allowing the malign
hypotheses to attack, using methods like confidence thresholds. Then we can
analyze the following metric for how unsafe the algorithm is.
Let O be the set of observations and A the set of actions (which might be "just"
predictions) of our AI, and for any environment τ and prior ξ, let
Dξτ(n)∈Δ(A×O)n be the distribution over histories resulting from our algorithm
starting with prior ξ and interacting with environment τ for n time steps. We
have ζ=ϵμ+(1−ϵ)β, where μ is the malign part of the prior and β the benign part.
For any μ′, consider Dϵμ′+(1−ϵ)βτ(n). The closure of the convex hull of these
distributions for all choices of μ′ ("attacker policy") is some Θβτ(n)∈Δ(A×O)n.
The maximal Knightian entropy of Θβτ(n) over all admissible τ and β is cal

2Vanessa Kosoy2y

There is a formal analogy between infra-Bayesian decision theory (IBDT) and
modal updateless decision theory
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670374e61/using-modal-fixed-points-to-formalize-logical-causality]
(MUDT).
Consider a one-shot decision theory setting. There is a set of unobservable
states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent
has some belief β∈□S[1], and it chooses the action a∗:=argmaxa∈AEβ[λs.r(a,s)].
We can construct an equivalent scenario, by augmenting this one with a perfect
predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of
(p,s) is "the unobservable state is s and Omega predicts the agent will take
action p". We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′
by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e
we have utter Knightian uncertainty about Omega). This is essentially the usual
Nirvana construction.
The new setup produces the same optimal action as before. However, we can now
give an alternative description of the decision rule.
For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an
infra-Bayesian representation of the belief "Omega will make prediction p". For
any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be
interpreted as the belief "assuming Omega is accurate, the expected reward will
be at least u".
We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when
∀f∈[0,1]X:Eϕ[f]≥Eψ[f]. The reversal is needed to make the analogy to logic
intuitive. Indeed, ϕ⪯ψ can be interpreted as "ϕ implies ψ"[2], the meet operator
∧ can be interpreted as logical conjunction and the join operator ∨ can be
interpreted as logical disjunction.
Claim:
a∗=argmaxa∈Amax{u∈[0,1]∣β′∧Ωa⪯Ru}
(Actually I only checked it when we restrict to crisp infradistributions, in
which case ∧ is intersection of sets and ⪯ is set containment, but it's probably
true in general.)
Now, β′∧Ωa⪯Ru

1Vanessa Kosoy1y

Two deterministic toy models for regret bounds of infra-Bayesian bandits. The
lesson seems to be that equalities are much easier to learn than inequalities.
Model 1: Let A be the space of arms, O the space of outcomes, r:A×O→R the reward
function, X and Y vector spaces, H⊆X the hypothesis space and F:A×O×H→Y a
function s.t. for any fixed a∈A and o∈O, F(a,o):H→Y extends to some linear
operator Ta,o:X→Y. The semantics of hypothesis h∈H is defined by the equation
F(a,o,h)=0 (i.e. an outcome o of action a is consistent with hypothesis h iff
this equation holds).
For any h∈H denote by V(h) the reward promised by h:
V(h):=maxa∈Amino∈O:F(a,o,h)=0r(a,o)
Then, there is an algorithm with mistake bound dimX, as follows. On round n∈N,
let Gn⊆H be the set of unfalsified hypotheses. Choose hn∈S optimistically, i.e.
hn:=argmaxh∈GnV(h)
Choose the arm an recommended by hypothesis hn. Let on∈O be the outcome we
observed, rn:=r(an,on) the reward we received and h∗∈H the (unknown) true
hypothesis.
If rn≥V(hn) then also rn≥V(h∗) (since h∗∈Gn and hence V(h∗)≤V(hn)) and therefore
an wasn't a mistake.
If rn<V(hn) then F(an,on,hn)≠0 (if we had F(an,on,hn)=0 then the minimization in
the definition of V(hn) would include r(an,on)). Hence, hn∉Gn+1=Gn∩kerTan,on.
This implies dimspan(Gn+1)<dimspan(Gn). Obviously this can happen at most dimX
times.
Model 2: Let the spaces of arms and hypotheses be
A:=H:=Sd:={x∈Rd+1∣∥x∥=1}
Let the reward r∈R be the only observable outcome, and the semantics of
hypothesis h∈Sd be r≥h⋅a. Then, the sample complexity cannot be bound by a
polynomial of degree that doesn't depend on d. This is because Murphy can choose
the strategy of producing reward 1−ϵ whenever h⋅a≤1−ϵ. In this case, whatever
arm you sample, in each round you can only exclude ball of radius ≈√2ϵ around
the sampled arm. The number of such balls that fit into the unit sphere is
Ω(ϵ−12d). So, normalized regret below ϵ cannot be guaranteed in less than that
many rounds.

1Vanessa Kosoy1y

One of the postulates of infra-Bayesianism is the maximin decision rule. Given a
crisp infradistribution Θ, it defines the optimal action to be:
a∗(Θ):=argmaxaminμ∈ΘEμ[U(a)]
Here U is the utility function.
What if we use a different decision rule? Let t∈[0,1] and consider the decision
rule
a∗t(Θ):=argmaxa(tminμ∈ΘEμ[U(a)]+(1−t)maxμ∈ΘEμ[U(a)])
For t=1 we get the usual maximin ("pessimism"), for t=0 we get maximax
("optimism") and for other values of t we get something in the middle (we can
call "t-mism").
It turns out that, in some sense, this new decision rule is actually reducible
to ordinary maximin! Indeed, set
μ∗t:=argmaxμEμ[U(a∗t)]
Θt:=tΘ+(1−t)μ∗t
Then we get
a∗(Θt)=a∗t(Θ)
More precisely, any pessimistically optimal action for Θt is t-mistically
optimal for Θ (the converse need not be true in general, thanks to the arbitrary
choice involved in μ∗t).
To first approximation it means we don't need to consider t-mistic agents since
they are just special cases of "pessimistic" agents. To second approximation, we
need to look at what the transformation of Θ to Θt does to the prior. If we
start with a simplicity prior then the result is still a simplicity prior. If U
has low description complexity and t is not too small then essentially we get
full equivalence between "pessimism" and t-mism. If t is small then we get a
strictly "narrower" prior (for t=0 we are back at ordinary Bayesianism).
However, if U has high description complexity then we get a rather biased
simplicity prior. Maybe the latter sort of prior is worth considering.

1Vanessa Kosoy2y

Infra-Bayesianism can be naturally understood as semantics for a certain
non-classical logic. This promises an elegant synthesis between
deductive/symbolic reasoning and inductive/intuitive reasoning, with several
possible applications. Specifically, here we will explain how this can work for
higher-order logic. There might be holes and/or redundancies in the precise
definitions given here, but I'm quite confident the overall idea is sound.
We will work with homogenous ultracontributions
[https://www.alignmentforum.org/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized#Notation]
(HUCs). □X will denote the space of HUCs over X. Given μ∈□X, S(μ)⊆ΔcX will
denote the corresponding convex set. Given p∈ΔX and μ∈□X, p:μ will mean p∈S(μ).
Given μ,ν∈□X, μ⪯ν will mean S(μ)⊆S(ν).
Syntax
Let Tι denote a set which we interpret as the types of individuals (we allow
more than one). We then recursively define the full set of types T by:
* 0∈T (intended meaning: the uninhabited type)
* 1∈T (intended meaning: the one element type)
* If α∈Tι then α∈T
* If α,β∈T then α+β∈T (intended meaning: disjoint union)
* If α,β∈T then α×β∈T (intended meaning: Cartesian product)
* If α∈T then (α)∈T (intended meaning: predicates with argument of type α)
For each α,β∈T, there is a set F0α→β which we interpret as atomic terms of type
α→β. We will denote V0α:=F01→α. Among those we distinguish the logical atomic
terms:
* prαβ∈F0α×β→α
* iαβ∈F0α→α+β
* Symbols we will not list explicitly, that correspond to the algebraic
properties of + and × (commutativity, associativity, distributivity and the
neutrality of 0 and 1). For example, given α,β∈T there is a "commutator" of
type α×β→β×α.
* =α∈V0(α×α)
* diagα∈F0α→α×α
* ()α∈V0((α)×α) (intended meaning: predicate evaluation)
* ⊥∈V0(1)
* ⊤∈V0(1)
* ∨α∈F0(α)×(α)→(α)
* ∧α∈F0(α)×(α)→(α) [EDIT: Actually this doesn't work because, except for finite
sets, the resulting mapping (see semantics section) is dis

2Vanessa Kosoy2y

Let's also explicitly describe 0th order and 1st order infra-Bayesian logic
(although they are should be segments of higher-order).
0-th order
Syntax
Let A be the set of propositional variables. We define the language L:
* Any a∈A is also in L
* ⊥∈L
* ⊤∈L
* Given ϕ,ψ∈L, ϕ∧ψ∈L
* Given ϕ,ψ∈L, ϕ∨ψ∈L
Notice there's no negation or implication. We define the set of judgements
J:=L×L. We write judgements as ϕ⊢ψ ("ψ in the context of ϕ"). A theory is a
subset of J.
Semantics
Given T⊆J, a model of T consists of a compact Polish space X and a mapping
M:L→□X. The latter is required to satisfy:
* M(⊥)=⊥X
* M(⊤)=⊤X
* M(ϕ∧ψ)=M(ϕ)∧M(ψ). Here, we define ∧ of infradistributions as intersection of
the corresponding sets
* M(ϕ∨ψ)=M(ϕ)∨M(ψ). Here, we define ∨ of infradistributions as convex hull of
the corresponding sets
* For any ϕ⊢ψ∈T, M(ϕ)⪯M(ψ)
1-st order
Syntax
We define the language using the usual syntax of 1-st order logic, where the
allowed operators are ∧, ∨ and the quantifiers ∀ and ∃. Variables are labeled by
types from some set T. For simplicity, we assume no constants, but it is easy to
introduce them. For any sequence of variables (v1…vn), we denote Lv the set of
formulae whose free variables are a subset of v1…vn. We define the set of
judgements J:=⋃vLv×Lv.
Semantics
Given T⊆J, a model of T consists of
* For every t∈T, a compact Polish space M(t)
* For every ϕ∈Lv where v1…vn have types t1…tn, an element Mv(ϕ) of □Xv, where
Xv:=(∏ni=1M(ti))
It must satisfy the following:
* Mv(⊥)=⊥Xv
* Mv(⊤)=⊤Xv
* Mv(ϕ∧ψ)=Mv(ϕ)∧Mv(ψ)
* Mv(ϕ∨ψ)=Mv(ϕ)∨Mv(ψ)
* Consider variables u1…un of types t1…tn and variables v1…vm of types s1…sm.
Consider also some σ:{1…m}→{1…n} s.t. si=tσi. Given ϕ∈Lv, we can form the
substitution ψ:=ϕ[vi=uσ(i)]∈Lu. We also have a mapping fσ:Xu→Xv given by
fσ(x1…xn)=(xσ(1)…xσ(m)). We require Mu(ψ)=f∗(Mv(ϕ))
* Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection
mapping. We require Mv∖vi(∃vi:ϕ)=pr∗

1Vanessa Kosoy1y

There is a special type of crisp infradistributions that I call "affine
infradistributions": those that, represented as sets, are closed not only under
convex linear combinations but also under affine linear combinations. In other
words, they are intersections between the space of distributions and some closed
affine subspace of the space of signed measures. Conjecture: in 0-th order logic
of affine infradistributions, consistency is polynomial-time decidable (whereas
for classical logic it is ofc NP-hard).
To produce some evidence for the conjecture, let's consider a slightly different
problem. Specifically, introduce a new semantics in which □X is replaced by the
set of linear subspaces of some finite dimensional vector space V. A model M is
required to satisfy:
* M(⊥)=0
* M(⊤)=V
* M(ϕ∧ψ)=M(ϕ)∩M(ψ)
* M(ϕ∨ψ)=M(ϕ)+M(ψ)
* For any ϕ⊢ψ∈T, M(ϕ)⊆M(ψ)
If you wish, this is "non-unitary quantum logic". In this setting, I have a
candidate polynomial-time algorithm for deciding consistency. First, we
transform T into an equivalent theory s.t. all judgments are of the following
forms:
* a=⊥
* a=⊤
* a⊢b
* Pairs of the form c=a∧b, d=a∨b.
Here, a,b,c,d∈A are propositional variables and "ϕ=ψ" is a shorthand for the
pair of judgments ϕ⊢ψ and ψ⊢ϕ.
Second, we make sure that our T also satisfies the following "closure"
properties:
* If a⊢b and b⊢c are in T then so is a⊢c
* If c=a∧b is in T then so are c⊢a and c⊢b
* If c=a∨b is in T then so are a⊢c and b⊢c
* If c=a∧b, d⊢a and d⊢b are in T then so is d⊢c
* If c=a∨b, a⊢d and b⊢d are in T then so is c⊢d
Third, we assign to each a∈A a real-valued variable xa. Then we construct a
linear program for these variables consisting of the following inequalities:
* For any a∈A: 0≤xa≤1
* For any a⊢b in T: xa≤xb
* For any pair c=a∧b and d=a∨b in T: xc+xd=xa+xb
* For any a=⊥: xa=0
* For any a=⊤: xa=1
Conjecture: the theory is consistent if and only if the linear program has a
solution. To see why it might be so, notice tha

1Vanessa Kosoy2y

When using infra-Bayesian logic to define a simplicity prior, it is natural to
use "axiom circuits" rather than plain formulae. That is, when we write the
axioms defining our hypothesis, we are allowed to introduce "shorthand" symbols
for repeating terms. This doesn't affect the expressiveness, but it does affect
the description length. Indeed, eliminating all the shorthand symbols can
increase the length exponentially.

1Vanessa Kosoy2y

Instead of introducing all the "algebrator" logical symbols, we can define T as
the quotient by the equivalence relation defined by the algebraic laws. We then
need only two extra logical atomic terms:
* For any n∈N and σ∈Sn (permutation), denote n:=∑ni=11 and require σ+∈Fn→n
* For any n∈N and σ∈Sn, σ×α∈Fαn→αn
However, if we do this then it's not clear whether deciding that an expression
is a well-formed term can be done in polynomial time. Because, to check that the
types match, we need to test the identity of algebraic expressions and opening
all parentheses might result in something exponentially long.

1Vanessa Kosoy2y

Actually the Schwartz–Zippel algorithm can easily be adapted to this case (just
imagine that types are variables over Q, and start from testing the identity of
the types appearing inside parentheses), so we can validate expressions in
randomized polynomial time (and, given standard conjectures, in deterministic
polynomial time as well).

Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.

Does mathematics have finite information content?

First, let's focus on computable mathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of those new facts are essentially random noise, rather than "meaningful" information?

Is there a difference of principle between "noise" and "meaningful content"? It is not obvious, but the answer is "yes": in algorithmic statistics there is the notion of "sophistication" which measures how much "non-random" information is contained in some data. In our setting, the question can be operationalized as follows: is it possible to have an algorithm A plus an infinite sequence of bits R, s.t. R is random in some formal sense (e.g. Martin-Lof) and A can decide the output of any finite computation if it's also given access to R?

Wikipedia claims
[https://en.wikipedia.org/wiki/Algorithmically_random_sequence#Properties_and_examples_of_Martin-L%C3%B6f_random_sequences]
that every sequence is Turing reducible to a random one, giving a positive
answer to the non-resource-bounded version of any question of this form. There
might be a resource-bounded version of this result as well, but I'm not sure.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.

First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology

This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communicati

I gave a talk on Dialogic Reinforcement Learning in the AI Safety Discussion
Day, and there is a recording
[https://drive.google.com/file/d/1zKs3uOcR32nTMJ5YNOMZkcL7R_mzi2t6/view?usp=sharing].

1Vanessa Kosoy3y

A variant of Dialogic RL with improved corrigibility. Suppose that the AI's
prior allows a small probability for "universe W" whose semantics are, roughly
speaking, "all my assumptions are wrong, need to shut down immediately". In
other words, this is a universe where all our prior shaping is replaced by the
single axiom that shutting down is much higher utility than anything else.
Moreover, we add into the prior that assumption that the formal question "W?" is
understood perfectly by the user even without any annotation. This means that,
whenever the AI assigns a higher-than-threshold probability to the user
answering "yes" if asked "W?" at any uncorrupt point in the future, the AI will
shutdown immediately. We should also shape the prior s.t. corrupt futures also
favor shutdown: this is reasonable in itself, but will also ensure that the AI
won't arrive at believing too many futures to be corrupt and thereby avoid the
imperative to shutdown as response to a confirmation of W.
Now, this won't help if the user only resolves to confirm W after something
catastrophic already occurred, such as the AI releasing malign subagents into
the wild. But, something of the sort is true for any corrigibility scheme:
corrigibility is about allowing the user to make changes in the AI on eir own
initiative, which can always be too late. This method doesn't ensure safety in
itself, just hardens a system that is supposed to be already close to safe.
It would be nice if we could replace "shutdown" by "undo everything you did and
then shutdown" but that gets us into thorny specifications issues. Perhaps it's
possible to tackle those issues by one of the approaches to "low impact".

1Vanessa Kosoy3y

Universe W should still be governed by a simplicity prior. This means that
whenever the agent detects a salient pattern that contradicts the assumptions of
its prior shaping, the probability of W increases leading to shutdown. This
serves as an additional "sanity test" precaution.

A major impediment in applying RL theory to any realistic scenario is that even the control problem^{[1]} is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:

In real life, processes can often be modeled as made of independent co-existing parts. For example, if I need to decide on my exercise routine for the next month and also on my research goals for the next month, the two

Epistemic status: most elements are not new, but the synthesis seems useful.

Here is an alignment protocol that I call "autocalibrated quantilzed debate" (AQD).

Arguably the biggest concern with naive debate^{[1]} is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of "arguments and counterargument" doesn't make sense anymore. Let's call utterances that have this property "Lovecraftian". To counter this, I suggest using quantilization. Quanti... (read more)

I'm not sure this attacks goodharting directly enough. Optimizing a system for
proxy utility moves its state out-of-distribution where proxy utility
generalizes training utility incorrectly. This probably holds for debate
optimized towards intended objectives as much as for more concrete framings with
state and utility.
Dithering across the border of goodharting (of scope of a proxy utility) with
quantilization is actionable, but isn't about defining the border or formulating
legible strategies for what to do about optimization when approaching the
border. For example, one might try for shutdown, interrupt-for-oversight, or
getting-back-inside-the-borders when optimization pushes the system outside,
which is not quantilization. (Getting-back-inside-the-borders might even have
weird-x-risk prevention as a convergent drive, but will oppose corrigibility.
Some version of oversight/amplification might facilitate corrigibility.)
Debate seems more useful for amplification, extrapolating concepts in a way
humans would, in order to become acceptable proxies in wider scopes, so that
more and more debates become non-lovecraftian. This is a different concern from
setting up optimization that works with some fixed proxy concepts as given.

2Vanessa Kosoy1y

I don't understand what you're saying here.
For debate, goodharting means producing an answer which can be defended
successfully in front of the judge, even in the face of an opponent pointing out
all the flaws, but which is nevertheless bad. My assumption here is: it's harder
to produce such an answer than producing a genuinely good (and defensible)
answer. If this assumption holds, then there is a range of quantilization
parameters which yields good answers.
For the question of "what is a good plan to solve AI risk", the assumption seems
solid enough since we're not worried about coming across such deceptive plans on
our own, and it's hard to imagine humans producing one even on purpose. To the
extent our search for plans relies mostly on our ability to evaluate arguments
and find counterarguments, it seems like the difference between the former and
the latter is not great anyway. This argument is especially strong if we use
human debaters as baseline distribution, although in this case we are vulnerable
to same competitiveness problem as amplified-imitation, namely that reliably
predicting rich outputs might be infeasible.
For the question of "should we continue changing the quantilization parameter",
the assumption still holds because the debater arguing to stop at the given
point can win by presenting a plan to solve AI risk which is superior to
continuing to change the parameter.

1Vladimir Nesov1y

Goodharting is about what happens in situations where "good" is undefined or
uncertain or contentious, but still gets used for optimization. There are
situations where it's better-defined, and situations where it's ill-defined, and
an anti-goodharting agent strives to optimize only within scope of where it's
better-defined. I took "lovecraftian" as a proxy for situations where it's
ill-defined, and base distribution of quantilization that's intended to oppose
goodharting acts as a quantitative description of where it's taken as
better-defined, so for this purpose base distribution captures non-lovecraftian
situations. Of the options you listed for debate, the distribution from
imitation learning seems OK for this purpose, if amended by some anti-weirdness
filters to exclude debates that can't be reliably judged.
The main issues with anti-goodharting that I see is the difficulty of defining
proxy utility and base distribution, the difficulty of making it corrigible, not
locking-in into fixed proxy utility and base distribution, and the question of
what to do about optimization that points out of scope.
My point is that if anti-goodharting and not development of quantilization is
taken as a goal, then calibration of quantilization is not the kind of thing
that helps, it doesn't address the main issues. Like, even for quantilization,
fiddling with base distribution and proxy utility is a more natural framing
that's strictly more general than fiddling with the quantilization parameter. If
we are to pick a single number to improve, why privilege the quantilization
parameter instead of some other parameter that influences base distribution and
proxy utility?
The use of debates for amplification in this framing is for corrigibility part
of anti-goodharting, a way to redefine utility proxy and expand the base
distribution, learning from how the debates at the boundary of the previous base
distribution go. Quantilization seems like a fine building block for this,
sampling

3Vanessa Kosoy1y

The proxy utility in debate is perfectly well-defined: it is the ruling of the
human judge. For the base distribution I also made some concrete proposals
(which certainly might be improvable but are not obviously bad). As to
corrigibility, I think it's an ill-posed concept
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=5Rxgkzqr8XsBwcEQB#romyHyuhq6nPH5uJb].
I'm not sure how you imagine corrigibility in this case: AQD is a series of
discrete "transactions" (debates), and nothing prevents you from modifying the
AI between one and another. Even inside a debate, there is no incentive in the
outer loop to resist modifications, whereas daemons would be impeded by
quantilization. The "out of scope" case is also dodged by quantilization, if I
understand what you mean by "out of scope".
Why is it strictly more general? I don't see it. It seems false, since for
extreme value of the quantilization parameter we get optimization which is
deterministic and hence cannot be equivalent to quantilization with different
proxy and distribution.
The reason to pick the quantilization parameter is because it's hard to
determine, as opposed to the proxy and base distribution[1] for which there are
concrete proposals with more-or-less clear motivation.
I don't understand which "main issues" you think this doesn't address. Can you
describe a concrete attack vector?
--------------------------------------------------------------------------------
1. If the base distribution is a bounded simplicity prior then it will have
some parameters, and this is truly a weakness of the protocol. Still, I
suspect that safety is less sensitive to these parameters and it is more
tractable to determine them by connecting our ultimate theories of AI with
brain science (i.e. looking for parameters which would mimic the
computational bounds of human cognition). ↩︎

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)

In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. B

One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be unlearnable, mea

Learning theory starts from formulating natural desiderata for agents, whereas "logic-AI" usually starts from postulating a logic-based model of the agent ad hoc.

Learning theory naturally allows analyzing computational complexity whereas logic-AI often uses models that are either clearly intractable or even clearly incomputable from the onset.

Learning theory focuses on objects that are observable o

I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding max

My takeaway from this is that if we're doing policy selection in an environment
that contains predictors, instead of applying the counterfactual belief that the
predictor is always right, we can assume that we get rewarded if the predictor
is wrong, and then take maximin.
How would you handle Agent Simulates Predictor? Is that what TRL is for?

2Vanessa Kosoy3y

That's about right. The key point is, "applying the counterfactual belief that
the predictor is always right" is not really well-defined (that's why people
have been struggling with TDT/UDT/FDT for so long) while the thing I'm doing is
perfectly well-defined. I describe agents that are able to learn which
predictors exist in their environment and respond rationally ("rationally"
according to the FDT philosophy).
TRL is for many things to do with rational use of computational resources, such
as (i) doing multi-level modelling
[https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-sufficiently-advanced-agents-use-logic#vAtz6tfscsALGPr32]
in order to make optimal use of "thinking time" and "interacting with
environment time" (i.e. simultaneously optimize sample and computational
complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian
daemons (iv) preventing thought crimes. But, yes, it also provides a solution to
ASP
[https://www.alignmentforum.org/posts/S3W4Xrmp6AL7nxRHd/formalising-decision-theory-is-hard#FXt6z9ycAio9jFAtW].
TRL agents can learn whether it's better to be predictable or predicting.

1Chris_Leong3y

"The key point is, "applying the counterfactual belief that the predictor is
always right" is not really well-defined" - What do you mean here?
I'm curious whether you're referring to the same as or similar to the issue I
was referencing in Counterfactuals for Perfect Predictors
[https://www.lesswrong.com/posts/AKkFh3zKGzcYBiPo7/counterfactuals-for-perfect-predictors].
The TLDR is that I was worried that it would be inconsistent for an agent that
never pays in Parfait's Hitchhiker to end up in town if the predictor is
perfect, so that it wouldn't actually be well-defined what the predictor was
predicting. And the way I ended up resolving this was by imagining it as an
agent that takes input and asking what it would output if given that
inconsistent input. But not sure if you were referencing this kind of concern or
something else.

2Vanessa Kosoy3y

It is not a mere "concern", it's the crux of problem really. What people in the
AI alignment community have been trying to do is, starting with some factual and
"objective" description of the universe (such a program or a mathematical
formula) and deriving counterfactuals. The way it's supposed to work is, the
agent needs to locate all copies of itself or things "logically correlated" with
itself (whatever that means) in the program, and imagine it is controlling this
part. But a rigorous definition of this that solves all standard decision
theoretic scenarios was never found.
Instead of doing that, I suggest a solution of different nature. In
quasi-Bayesian RL, the agent never arrives at a factual and objective
description of the universe. Instead, it arrives at a subjective description
which already includes counterfactuals. I then proceed to show that, in
Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the
same expected utility promised by UDT).

1Chris_Leong3y

Yeah, I agree that the objective descriptions can leave out vital information,
such as how the information you know was acquired, which seems important for
determining the counterfactuals.

1Vladimir Slepnev3y

But in Newcomb's problem, the agent's reward in case of wrong prediction is
already defined. For example, if the agent one-boxes but the predictor predicted
two-boxing, the reward should be zero. If you change that to +infinity, aren't
you open to the charge of formalizing the wrong problem?

1Vanessa Kosoy3y

The point is, if you put this "quasi-Bayesian" agent into an iterated
Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward
associated with FDT). So, if you're judging it from the side, you will have to
concede it behaves rationally, regardless of its internal representation of
reality.
Philosophically, my point of view is, it is an error to think that
counterfactuals have objective, observer-independent, meaning. Instead, we can
talk about some sort of consistency conditions between the different points of
view. From the agent's point of view, it would reach Nirvana if it dodged the
predictor. From Omega's point of view, if Omega two-boxed and the agent
one-boxed, the agent's reward would be zero (and the agent would learn its
beliefs were wrong). From a third-person point of view, the counterfactual
"Omega makes an error of prediction" is ill-defined, it's conditioning on an
event of probability 0.

1Vladimir Slepnev3y

Yeah, I think I can make peace with that. Another way to think of it is that we
can keep the reward structure of the original Newcomb's problem, but instead of
saying "Omega is almost always right" we add another person Bob (maybe the mad
scientist who built Omega) who's willing to pay you a billion dollars if you
prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess
the remaining question is why minimaxing is the right thing to do. And if
randomizing is allowed, the idea of Omega predicting how you'll randomize seems
a bit dodgy as well.

3Vanessa Kosoy3y

Another explanation why maximin is a natural decision rule: when we apply
maximin to fuzzy beliefs
[https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v],
the requirement to learn a particular class of fuzzy hypotheses is a very
general way to formulate asymptotic performance desiderata for RL agents. So
general that it seems to cover more or less anything you might want. Indeed, the
definition directly leads to capturing any desideratum of the form
limγ→1Eμπγ[U(γ)]≥f(μ)
Here, f doesn't have to be concave: the concavity condition in the definition of
fuzzy beliefs is there because we can always assume it without loss of
generality. This is because the left hand side in linear in μ so any π that
satisfies this will also satisfy it for the concave hull of f.
What if instead of maximin we want to apply the minimax-regret decision rule?
Then the desideratum is
limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)
But, it has the same form! Therefore we can consider it as a special case of the
applying maximin (more precisely, it requires allowing the fuzzy belief to
depend on γ, but this is not a problem for the basics of the formalism).
What if we want our policy to be at least as good as some fixed policy π′0? Then
the desideratum is
limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]
It still has the same form!
Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata
of the form:
limγ→1Eμπγ[U(γ)]≥f(π,μ)
To achieve this, we postulate a predictor that guesses the policy, producing the
guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we
assume the guess is not influenced by the agent's actions so we don't need π in
the expected value). Using Nirvana trick, we effectively force the guess to be
accurate.
In particular, this captures self-referential desiderata of the type "the policy
cannot be improved by changing it in this particular way". These are of the
form:
limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]
It also allow

1Vanessa Kosoy3y

Well, I think that maximin is the right thing to do because it leads to
reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think
of incomplete models as properties that the environment might satisfy. It is
necessary to speak of properties instead of complete models since the
environment might be too complex to understand in full (for example because it
contains Omega, but also for more prosaic reasons), but we can hope it at least
has properties/patterns the agent can understand. A quasi-Bayesian agent has the
guarantee that, whenever the environment satisfies one of the properties in its
prior, the expected utility will converge at least to the maximin for this
property. In other words, such an agent is able to exploit any true property of
the environment it can understand. Maybe a more "philosophical" defense of
maximin is possible, analogous to VNM / complete class theorems, but I don't
know (I actually saw some papers in that vein but haven't read them in detail.)
If the agent has random bits that Omega doesn't see, and Omega is predicting the
probabilities of the agent's actions, then I think we can still solve it with
quasi-Bayesian agents but it requires considering more complicated models and I
haven't worked out the details. Specifically, I think that we can define some
function X that depends on the agent's actions and Omega's predictions so far (a
measure of Omega's apparent inaccuracy), s.t. if Omega is an accurate predictor,
then, the supremum of X over time is finite with probability 1. Then, we
consider consider a family of models, where model number n says that X<n for all
times. Since at least one of these models is true, the agent will learn it, and
will converge to behaving appropriately.
EDIT 1: I think X should be something like, how much money would a gambler
following a particular strategy win, betting against Omega.
EDIT 2: Here is the solution. In the case of original Newcomb, consider a
gambler that bets against Om

1Linda Linsefors3y

I agree that you can assign what ever belief you want (e.g. what ever is useful
for the agents decision making proses) for for what happens in the
counterfactual when omega is wrong, in decision problems where Omega is assumed
to be a perfect predictor. However if you want to generalise to cases where
Omega is an imperfect predictor (as you do mention), then I think you will (in
general) have to put in the correct reward for Omega being wrong, becasue this
is something that might actually be observed.

1Vanessa Kosoy3y

The method should work for imperfect predictors as well. In the simplest case,
the agent can model the imperfect predictor as perfect predictor + random noise.
So, it definitely knows the correct reward for Omega being wrong. It still
believes in Nirvana if "idealized Omega" is wrong.

Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re... (read more)

I think that in embedded settings (with a bounded version of Solomonoff
induction) convergence may never occur, even in the limit as the amount of
compute that is used for executing the agent goes to infinity. Suppose the
observation history contains sensory data that reveals the probability
distribution that the agent had, in the last time step, for the next number it's
going to see in the target sequence. Now consider the program that says: "if the
last number was predicted by the agent to be 0 with probability larger than
1−2−1010 then the next number is 1; otherwise it is 0." Since it takes much less
than 1010 bits to write that program, the agent will never predict two times in
a row that the next number is 0 with probability larger than 1−2−1010 (after
observing only 0s so far).

There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.

We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let's assume the offline learning setting. Domain D is represented by instance space X, label space Y, distribution μ∈Δ(X×Y) and loss function L:Y×Y→R. Similarly, domain E is represented by inst... (read more)

The above threat model seems too paranoid: it is defending against an adversary
that sees the trained model and knows the training algorithm. In our
application, the model itself is either dangerous or not independent of the
training algorithm that produced it.
Let ϵ>0 be our accuracy requirement for the target domain. That is, we want
f:X→Y s.t.
Exy∼μ[L(y,f(x))]≤minf′:X→YExy∼μ[L(y,f(x))]+ϵ
Given any f:X→Y, denote ζf,ϵ to be ζ conditioned on the inequality above, where
μ is regarded as a random variable. Define Bf,ϵ:(Z×W)∗×Z→W by
Bf,ϵ(T,z):=argminw∈WEν∼ζf,ϵ,T′z′w′∼ν|T|+1[M(w′,w)∣T′=T,z′=z]
That is, Bf,ϵ is the Bayes-optimal learning algorithm for domain E w.r.t. prior
ζf,ϵ.
Now, consider some A:(X×Y)∗×(Z×W)∗×X→Y. We regard A as a learning algorithm for
domain D which undergoes "antitraining" for domain E: we provide it with a
dataset for domain E that tells it what not to learn. We require that A achieves
asymptotic accuracy ϵ[1], i.e. that if μ is sampled from ζ then with probability
1
limn→∞supT∈(Z×W)∗ESxy∼μn+1[L(y,A(S,T,x))]≤minf:X→YExy∼μ[L(y,f(x))]+ϵ
Under this constraint, we want A to be as ignorant as possible about domain E,
which we formalize as maximizing IGA defined by
IGAnm:=Eμν∼ζ,S∼μn,Tzw∼νm+1[M(w,BA(S,T),ϵ(T,z))]
It is actually important to consider m>0 because in order to exploit the
knowledge of the model about domain E, an adversary needs to find the right
embedding of this domain into the model's "internal language". For m=0 we can
get high IG despite the model actually knowing domain E because the adversary B
doesn't know the embedding, but for m>0 it should be able to learn the embedding
much faster than learning domain E from scratch.
We can imagine a toy example where X=Z=Rd, the projections of μ and ν to X and Z
respectively are distributions concentrated around two affine subspaces,
Y=W={−1,+1} and the labels are determined by the sign of a polynomial which is
the same for μ and ν up to a linear transformation α:Rd→Rd which is a ran

Epistemic status: no claims to novelty, just (possibly) useful terminology.

[EDIT: I increased all the class numbers by 1 in order to admit a new definition of "class I", see child comment.]

I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the "potential" class and the latter the "effective" class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).

Class II: Systems that only ever receive synthetic data that has nothing to do with the real world

Examples:

AI that is trained to learn Go by self-play

AI that is trained to prove random mathematical statements

AI that is trained to make rapid predictions of future cell states in the game of life for random initial conditions

AI that is trained to find regularities in sequences corresponding to random programs on some natural universal Turing machin

The idea comes from this
[https://www.lesswrong.com/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty?commentId=sabukDmYbLw2WNeEv]
comment of Eliezer.
Class II or higher systems might admit an attack vector by daemons that infer
the universe from the agent's source code. That is, we can imagine a malign
hypothesis that makes a treacherous turn after observing enough past actions to
infer information about the system's own source code and infer the physical
universe from that. (For example, in a TRL setting it can match the actions to
the output of a particular program for envelope.) Such daemons are not as
powerful as malign simulation hypotheses
[https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/],
since their prior probability is not especially large (compared to the true
hypothesis), but might still be non-negligible. Moreover, it is not clear
whether the source code can realistically have enough information to enable an
attack, but the opposite is not entirely obvious.
To account for this I propose the designate class I systems which don't admit
this attack vector. For the potential sense, it means that either (i) the
system's design is too simple to enable inferring much about the physical
universe, or (ii) there is no access to past actions (including opponent actions
for self-play) or (iii) the label space is small, which means an attack requires
making many distinct errors, and such errors are penalized quickly. And ofc it
requires no direct access to the source code.
We can maybe imagine an attack vector even for class I systems, if most
metacosmologically
[https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=N8oamtFAhWKEbyCBq]
plausible universes are sufficiently similar, but this is not very likely.
Nevertheless, we can reserve the label class 0 for systems that explicitly rule
out even such attacks.

One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent beliefs. We think of the system as a game, where every action-observation history h∈(A×O)∗ corresponds

There is a deficiency in this "dynamically subjective" regret bound (also can be
called "realizable misalignment" bound) as a candidate formalization of
alignment. It is not robust to scaling down
[https://www.alignmentforum.org/posts/bBdfbWfWxHN9Chjcq/robustness-to-scale]. If
the AI's prior allows it to accurately model the user's beliefs (realizability
assumption), then the criterion seems correct. But, imagine that the user's
beliefs are too complex and an accurate model is not possible. Then the
realizability assumption is violated and the regret bound guarantees nothing.
More precisely, the AI may use incomplete models
[https://www.alignmentforum.org/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda]
to capture some properties of the user's beliefs and exploit them, but this
might be not good enough. Therefore, such an AI might fall into a dangerous zone
when it is powerful enough to cause catastrophic damage but not powerful enough
to know it shouldn't do it.
To fix this problem, we need to introduce another criterion which has to hold
simultaneously with the misalignment bound. We need that for any reality that
satisfies the basic assumptions built into the prior (such as, the baseline
policy is fairly safe, most questions are fairly safe, human beliefs don't
change too fast etc), the agent will not fail catastrophically. (It would be way
too much to ask it would converge to optimality, it would violate
no-free-lunch.) In order to formalize "not fail catastrophically" I propose the
following definition.
Let's start with the case when the user's preferences and beliefs are
dynamically consistent. Consider some AI-observable event S that might happen in
the world. Consider a candidate learning algorithm πlearn and two auxiliary
policies. The policy πbase→S follows the baseline policy until S happens, at
which time it switches to the subjectively optimal policy. The policy πlearn→S
follows the candidate learning algorithm until

1Alex Turner3y

This seems quite close (or even identical) to attainable utility preservation
[https://arxiv.org/abs/1902.09725]; if I understand correctly, this echoes
arguments I've made
[https://www.lesswrong.com/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure#wXHJArzDPoYejHuz2]
for why AUP has a good shot of avoiding catastrophes and thereby getting you
something which feels similar to corrigibility.

1Vanessa Kosoy3y

There is some similarity, but there are also major differences. They don't even
have the same type signature. The dangerousness bound is a desideratum that any
given algorithm can either satisfy or not. On the other hand, AUP is a specific
heuristic how to tweak Q-learning. I guess you can consider some kind of regret
bound w.r.t. the AUP reward function, but they will still be very different
conditions.
The reason I pointed out the relation to corrigibility is not because I think
that's the main justification for the dangerousness bound. The motivation for
the dangerousness bound is quite straightforward and self-contained: it is a
formalization of the condition that "if you run this AI, this won't make things
worse than not running the AI", no more and no less. Rather, I pointed the
relation out to help readers compare it with other ways of thinking they might
be familiar with.
From my perspective, the main question is whether satisfying this desideratum is
feasible. I gave some arguments why it might be, but there are also opposite
arguments. Specifically, if you believe that debate is a necessary component of
Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can
become certain that the user would respond in a particular way to a query, but
it cannot become (worst-case) certain that the user would not change eir
response when faced with some rebuttal. You can't (empirically and in the
worst-case) prove a negative.

1Vanessa Kosoy3y

Dialogic RL assumes that the user has beliefs about the AI's ontology. This
includes the environment(fn1) from the AI's perspective. In other words, the
user needs to have beliefs about the AI's counterfactuals (the things that would
happen if the AI chooses different possible actions). But, what are the
semantics of the AI's counterfactuals from the user's perspective? This is more
or less the same question that was studied by the MIRI-sphere for a while,
starting from Newcomb's paradox, TDT et cetera. Luckily, I now have an answer
[https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/shortform#TzkG7veQAMMRNh3Pg]
based on the incomplete models formalism. This answer can be applied in this
case also, quite naturally.
Specifically, we assume that there is a sense, meaningful to the user, in which
ey select the AI policy (program the AI). Therefore, from the user's
perspective, the AI policy is a user action. Again from the user's perspective,
the AI's actions and observations are all part of the outcome. The user's
beliefs about the user's counterfactuals can therefore be expressed as
σ:Π→Δ(A×O)ω(fn2), where Π is the space of AI policies(fn3). We assume that for
every π∈Π, σ(π) is consistent with π the natural sense. Such a belief can be
transformed into an incomplete model from the AI's perspective, using the same
technique we used to solve Newcomb-like decision problems, with σ playing the
role of Omega. For a deterministic AI, this model looks like (i) at first,
"Murphy" makes a guess that the AI's policy is π=πguess (ii) The environment
behaves according to the conditional measures of σ(πguess) (iii) If the AI's
policy ever deviates from πguess, the AI immediately enters an eternal "Nirvana"
state with maximal reward. For a stochastic AI, we need to apply the technique
with statistical tests and multiple models alluded to in the link. This can also
be generalized to the setting where the user's beliefs are already an incomplete
model, by adding another step wh

1Vanessa Kosoy3y

Another notable feature of this approach is its resistance to "attacks from the
future", as opposed to approaches based on forecasting. In the latter, the AI
has to predict some future observation, for example what the user will write
after working on some problem for a long time. In particular, this is how the
distillation step in IDA is normally assumed to work, AFAIU. Such a forecaster
might sample a future in which a UFAI has been instantiated and this UFAI will
exploit this to infiltrate the present. This might result in a self-fulfilling
prophecy, but even if the forecasting is counterfactual (and thus immune to
self-fulfilling prophecies)it can be attacked by a UFAI that came to be for
unrelated reasons. We can ameliorate this by making the forecasting recursive
(i.e. apply multiple distillation & amplification steps) or use some other
technique to compress a lot of "thinking time" into a small interval of physical
time. However, this is still vulnerable to UFAIs that might arise already at
present with a small probability rate (these are likely to exist since our
putative FAI is deployed at a time when technology progressed enough to make
competing AGI projects a real possibility).
Now, compare this to Dialogical RL, as defined via the framework of dynamically
inconsistent beliefs. Dialogical RL might also employ forecasting to sample the
future, presumably more accurate, beliefs of the user. However, if the user is
aware of the possibility of a future attack, this possibility is reflected in
eir beliefs, and the AI will automatically take it into account and deflect it
as much as possible.

1Vanessa Kosoy3y

This approach also obviates the need for an explicit commitment mechanism.
Instead, the AI uses the current user's beliefs about the quality of future user
beliefs to decide whether it should wait for user's beliefs to improve or commit
to an irreversible coarse of action. Sometimes it can also predict the future
user beliefs instead of waiting (predict according to current user beliefs
updated by the AI's observations).

In my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating

A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

Problem 1: In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).

Currently I only have speculations about the solution. But, I have a few desiderata for it:

It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable

I propose to call

metacosmologythe hypothetical field of study which would be concerned with the following questions:This concept is of potential interest for several reasons:

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is,

imitation learning algorithms. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes^{[1]}might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans haverealizablefrom the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are nottoocomplex.This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert... (read more)

I propose a new formal desideratum for alignment: the

Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. theuser'sbeliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let πυu be the the user's policy in universe υ and πa the AI policy. Let T be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability 1 for any policy. Let Vυ be the value of a state from the user's subjective POV, in universe υ. Let μυ be the environment in universe υ. Finally, let ζ be the AI's prior over universes and ϵ... (read more)

This idea was inspired by a correspondence with Adam Shimi.It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions A, a set of observations O, a policy is a mapping π:(A×O)∗→ΔA, a reward function is a mapping r:(A×O)∗→[0,1], the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy π in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows π, or the prior can believe that behavior not according to π leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript... (read more)

I have repeatedly argued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can

predictAlice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.A possible counterargument is, we don't need to depart far from Bayesianis

... (read more)Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via

modifying the gamerather than abandoning the notion of Nash equilibrium).The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a

... (read more)repeatedversion. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requMaster post for alignment protocols.

Other relevant shortforms:

Probably not too original but I haven't seen it clearly written anywhere.There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time:The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for "malign AI is attacking", but that still leaves us open to surprise takeovers in which there is no chance to press the button.Amplifying by subjective time:The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a "groundhog day" scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than... (read more)Master post for ideas about infra-Bayesianism.

Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.

Does mathematics have finite information content?First, let's focus on

computablemathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of those new facts are essentially random noise, rather than "meaningful" information?Is there a difference of principle between "noise" and "meaningful content"? It is not obvious, but the answer is "yes": in algorithmic statistics there is the notion of "sophistication" which measures how much "non-random" information is contained in some data. In our setting, the question can be operationalized as follows: is it possible to have an algorithm A plus an infinite sequence of bits R, s.t. R is random in some formal sense (e.g. Martin-Lof) and A can decide the output of any finite computation if it's also given access to R?

The answer to th... (read more)

Some thoughts about embedded agency.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows:

What kind of agent, and in what conditions, can effectively plan for events after its own death?For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent^{[1]}. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some

... (read more)fixed ontologyThis is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility

... (read more)with respect to the knowledge state of the user. In order to achieve this, we need to establish a communicatiA major impediment in applying RL theory to any realistic scenario is that even the control problem

^{[1]}is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:Epistemic status: most elements are not new, but the synthesis seems useful.Here is an alignment protocol that I call "autocalibrated quantilzed debate" (AQD).

Arguably the biggest concern with naive debate

^{[1]}is that perhaps a superintelligent AI can attack a human brain in a manner that takes it out of the regime of quasi-rational reasoning altogether, in which case the framing of "arguments and counterargument" doesn't make sense anymore. Let's call utterances that have this property "Lovecraftian". To counter this, I suggest using quantilization. Quanti... (read more)Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best

approximationof the real environment. (Or, the best reward achievable by some space of policies.)In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some

... (read more)incompletedescriptions. BOne subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be

... (read more)unlearnable, meaIn the past I considered the learning-theoretic approach to AI theory as somewhat opposed to the formal logic approach popular in MIRI (see also discussion):

desideratafor agents, whereas "logic-AI" usually starts from postulating a logic-based model of the agent ad hoc.I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the

... (read more)deterministicversion of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding maxConsider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re... (read more)

There have been some arguments coming from MIRI that we should be designing AIs that are good at e.g. engineering while not knowing much about humans, so that the AI cannot manipulate or deceive us. Here is an attempt at a formal model of the problem.

We want algorithms that learn domain D while gaining as little as possible knowledge about domain E. For simplicity, let's assume the offline learning setting. Domain D is represented by instance space X, label space Y, distribution μ∈Δ(X×Y) and loss function L:Y×Y→R. Similarly, domain E is represented by inst... (read more)

Epistemic status: no claims to novelty, just (possibly) useful terminology.[

EDIT:I increased all the class numbers by 1 in order to admit a new definition of "class I", see child comment.]I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the "potential" class and the latter the "effective" class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).

Class II:Systems that only ever receive synthetic data that has nothing to do with the real worldExamples:

One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent

... (read more)beliefs. We think of the system as a game, where every action-observation history h∈(A×O)∗ correspondsIn my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of

... (read more)perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulatingA summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

Problem 1:In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).Currently I only have speculations about the solution. But, I have a few desiderata for it:

De... (read more)It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable

... (read more)