My Current Take on Counterfactuals

I lean towards some kind of finitism or constructivism, and am skeptical of utility functions which involve unbounded quantifiers. But also, how does LI help with the procrastination paradox? I don't think I've seen this result.

My Current Take on Counterfactuals

Yes, I'm pretty sure we have that kind of completeness. Obviously representing all hypotheses in this opaque form would give you poor sample and computational complexity, but you can do something midway: use black-box programs as components in your hypothesis but also have some explicit/transparent structure.

Updating the Lottery Ticket Hypothesis

IIUC, here's a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point *in the tangent space*. Since the tangent space is linear, this is easy to do (i.e. doesn't require heuristic gradient descent): for square loss it's just solving a large linear system once, for many other losses it should amount to *convex* optimization for which we have provable efficient algorithms. And, I guess it's underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I'm guessing some of the linked papers might have done something like this?

325dThis basically matches my current understanding. (Though I'm not strongly
confident in my current understanding.) I believe the GP results are basically
equivalent to this, but I haven't read up on the topic enough to be sure.

My Current Take on Counterfactuals

So we have this nice picture, where rationality is characterized by non-exploitability wrt a specific class of potential exploiters.

I'm not convinced this is the right desideratum for that purpose. Why should we care about exploitability by traders if making such trades is not actually possible given the environment and the utility function? IMO epistemic rationality is subservient to instrumental rationality, so our desiderata should be derived from the later.

... (read more)Human value-uncertainty is not particularly well-captured by Bayesian uncertainty, as I imag

223dIt's clear that you understand logical induction pretty well, so while I feel
like you're missing something, I'm not clear on what that could be.
I think maybe the more fruitful branch of this conversation (as opposed to me
trying to provide an instrumental justification for radical probabilism, though
I'm still interested in that) is the question of describing the human utility
function.
The logical induction picture isn't strictly at odds with a platonic utility
function, I think, since we can consider the limit. (I only claim that this
isn't the best way to think about it in general, since Nature didn't decide a
platonic utility function for us and then design us such that our reasoning has
the appropriate limit.)
For example, one case which to my mind argues in favor of the logical induction
approach to preferences: the procrastination paradox. All you want to do is
ensure that the button is pressed at some point. This isn't a particularly
complex or unrealistic preference for an agent to have. Yet, it's unclear how to
make computable beliefs think about this appropriately. Logical induction
provides a theory about how to think about this kind of goal. (I haven't thought
much about how TRL would handle it.)
Agree or disagree: agents can sensibly pursueΔ2objectives? And, do you think
that question is cruxy for you?

224dSo, one point is that the InfraBayes picture still gives epistemics an important
role: the kind of guarantee arrived at is a guarantee that you won't do too much
worse than the most useful partial model expects. So, we can think about
generalized partial models which update by thinking longer in addition to taking
in sense-data.
I suppose TRL can model this by observing what those computations would say, in
a given situation, and using partial models which only "trust computation X"
rather than having any content of their own. Is this "complete" in an
appropriate sense? Can we always model a would-be radical-infrabayesian as a TRL
agent observing what that radical-infrabayesian would think?
Even if true, there may be a significant computational complexity gap between
just doing the thing vs modeling it in this way.

31moThis does make sense to me, and I view it as a weakness of the idea. However,
the productivity of dutch-book type thinking in terms of implying properties
which seem appealing for other reasons speaks heavily in favor of it, in my
mind. A formal connection to more pragmatic criteria would be great.
But also, maybe I can articulate a radical-probabilist position without any
recourse to dutch books... I'll have to think more about that.
I'm not sure how to double crux with this intuition, unfortunately. When I
imagine the perspective you describe, I feel like it's rolling all dynamic
inconsistency into time-preference and ignoring the role of deliberation.
My claim is that there is a type of change-over-time which is due to
boundedness, and which looks like "dynamic inconsistency" from a classical
bayesian perspective, but which isn't inherently dynamically inconsistent. EG,
if you "sleep on it" and wake up with a different, firmer-feeling perspective,
without any articulable thing you updated on. (My point isn't to dogmatically
insist that you haven't updated on anything, but rather, to point out that it's
useful to have the perspective where we don't need to suppose there was evidence
which justifies the update as Bayesian, in order for it to be rational.)

My Current Take on Counterfactuals

I guess we can try studying Troll Bridge using infra-Bayesian modal logic, but atm I don't know what would result.

From a radical-probabilist perspective, the complaint would be that Turing RL still uses the InfraBayesian update rule, which might not always be necessary to be rational (the same way Bayesian updates aren't always necessary).

Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).

31moI agree that radical probabilism can be thought of as
bayesian-with-a-side-channel, but it's nice to have a more general
characterization where the side channel is black-box, rather than an explicit
side-channel which we explicitly update on. This gives us a picture of the space
of rational updates. EG, the logical induction criterion allows for a large
space of things to count as rational. We get to argue for constraints on
rational behavior by pointing to the existence of traders which enforce those
constraints, while being agnostic about what's going on inside a logical
inductor. So we have this nice picture, where rationality is characterized by
non-exploitability wrt a specific class of potential exploiters.
Here's an argument for why this is an important dimension to consider:
1. Human value-uncertainty is not particularly well-captured by Bayesian
uncertainty, as I imagine you'll agree. One particular complaint is
realizability: we have no particular reason to assume that human preferences
are within any particular space of hypotheses we can write down.
2. One aspect of this can be captured by InfraBayes: it allows us to eliminate
the realizability assumption, instead only assuming that human preferences
fall within some set of constraints which we can describe.
3. However, there is another aspect to human preference-uncertainty: human
preferences change over time. Some of this is irrational, but some of it is
legitimate philosophical deliberation.
4. And, somewhat in the spirit of logical induction, humans do tend to
eventually address the most egregious irrationalities.
5. Therefore, I tend to think that toy models of alignment (such as CIRL, DRL,
DIRL) should model the human as a radical probabilist; not because it's a
perfect model, but because it constitutes a major incremental improvement
wrt modeling what kind of uncertainty humans have over our own preferences.
Recognizing preferences as a thing whic

My Current Take on Counterfactuals

I only skimmed this post for now, but a few quick comments on links to infra-Bayesianism:

InfraBayes doesn’t seem to have that worry, since it applies to non-realizable cases. (Or does it? Is there some kind of non-oscillation guarantee? Or is non-oscillation part of what it means for a set of environments to be learnable -- IE it can oscillate in some cases?)... AFAIK the conditions for learnability in the InfraBayes case are still pretty wide open.

It's true that these questions still need work, but I think it's rather clear that something like "there ... (read more)

Is there a way to operationalize "respecting logic"? For example, a specific toy scenario where an infra-Bayesian agent would fail due to not respecting logic?

"Respect logic" means either (a) assigning probability one to tautologies (at least, to those which can be proved in some bounded proof-length, or something along those lines), or, (b) assigning probability zero to contradictions (again, modulo boundedness). These two properties should be basically equivalent (ie, imply each other) provided the proof system is consistent. If it's inconsistent, they i... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

From your reply to Paul, I understand your argument to be something like the following:

- Any solution to single-single alignment will involve a tradeoff between alignment and capability.
- If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
- If AI systems
*are*designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff. - Given the technical knowledge to design c

Formal Solution to the Inner Alignment Problem

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.

I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect ... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research. It seems like the failures described happened because the AI systems were misaligned in the usual "unipolar" sense. These management assistants, DAOs etc *are not aligned to the goals of their respective, individual users/owners*.

I do see two reasons why multipolar scenarios might require more technical research:

- Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient wa

51moI don't mean to say this post warrants a new kind of AI alignment research, and
I don't think I said that, but perhaps I'm missing some kind of subtext I'm
inadvertently sending?
I would say this post warrants research on multi-agent RL and/or AI social
choice and/or fairness and/or transparency, none of which are "new kinds" of
research (I promoted them heavily in my preceding post), and none of which I
would call "alignment research" (though I'll respect your decision to call all
these topics "alignment" if you consider them that).
I would say, and I did say:
I do hope that the RAAP concept can serve as a handle for noticing structure in
multi-agent systems, but again I don't consider this a "new kind of research",
only an important/necessary/neglected kind of research for the purposes of
existential safety. Apologies if I seemed more revolutionary than intended.
Perhaps it's uncommon to take a strong position of the form "X is
necessary/important/neglected for human survival" without also saying "X is a
fundamentally new type of thinking that no one has done before", but that is
indeed my stance for X∈{a variety of non-alignment AI research areas
[https://www.lesswrong.com/posts/hvGoYXi2kgnS3vxqb/some-ai-research-areas-and-their-relevance-to-existential-1]
}.

21moHow are you inferring this? From the fact that a negative outcome eventually
obtained? Or from particular misaligned decisions each system made? It would be
helpful if you could point to a particular single-agent decision in one of the
stories that you view as evidence of that single agent being highly misaligned
with its user or creator. I can then reply with how I envision that decision
being made even with high single-agent alignment.
Yes, this^.

Formal Solution to the Inner Alignment Problem

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

Yes, you're right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the "bridge rules" by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.

Here's the sketch of a proposal how to solve this. Let's construct our prior to be ... (read more)

61moI broadly think of this approach as "try to write down the 'right' universal
prior." I don't think the bridge rules / importance-weighting consideration is
the only way in which our universal prior is predictably bad. There are also
issues like anthropic update and philosophical considerations about what kind of
"programming language" to use and so on.
I'm kind of scared of this approach because I feel unless you really nail
everything there is going to be a gap that an attacker can exploit. I guess you
just need to get close enough thatεδis manageable but I think I still find it
scary (and don't totally remember all my sources of concern).
I think of this in contrast with my approach based on epistemic competitiveness
approach, where the idea is not necessarily to identify these considerations in
advance, but to be epistemically competitive with an attacker (inside one of
your hypotheses) who has noticed an improvement over your prior. That is, if
someone inside one of our hypotheses has noticed that e.g. a certain class of
decisions is more important and so they will simulate only those situations,
then we should also notice this and by the same token care more about our
decision if we are in one of those situations (rather than using a universal
prior without importance weighting). My sense is that without competitiveness we
are in trouble anyway on other fronts, and so it is probably also reasonable to
think of as a first-line defense against this kind of issue.
This is very similar to what I first thought about when going down this line. My
instantiation runs into trouble with "giant" universes that do all the possible
computations you would want, and then using the "free" complexity in the bridge
rules to pick which of the computations you actually wanted. I am not sure if
the DFA proposal gets around this kind of problem though it sounds like it would
be pretty similar.

Vanessa Kosoy's Shortform

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is and your quantilization fraction is then the AI's probability of corrupt... (read more)

Vanessa Kosoy's Shortform

More observations about this attack vector ("attack from counterfactuals"). I focus on "amplifying by subjective time".

**The harder the takeoff the more dangerous this attack vector:**During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system*in the beginning of the cycle*^{[1]}. On the other hand, the capability of the attacker depends on its power*in the end of the cycle*. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defende

Inframeasures and Domain Theory

Virtually all the credit for this post goes to Alex, I think the proof of Proposition 1 was more or less my only contribution.

Vanessa Kosoy's Shortform

The distribution is the user's policy, and the utility function for this purpose is the *eventual success probability* estimated by the user (as part of the timeline report), in the end of the "maneuver". More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.

11moOh, right, that makes a lot of sense.
So is the general idea that we quantilize such that we're choosing in
expectation an action that doesn't have corrupted utility (by intuitively having
something like more than twice as many actions in the quantilization than we
expect to be corrupted), so that we guarantee the probability of following the
manipulation of the learned user report is small?
I also wonder if using the user policy to sample actions isn't limiting, because
then we can only take actions that the user would take. Or do you assume by
default that the support of the user policy is the full action space, so every
action is possible for the AI?

Introduction To The Infra-Bayesianism Sequence

IIUC your question can be reformulated as follows: a crisp infradistribution can be regarded as a claim about reality (the true distribution is inside the set), but it's not clear how to generalize this to non-crisp. Well, if you think in terms of desiderata, then crisp says: if distribution is inside set then we have some lower bound on expected utility (and if it's not then we don't promise anything). On the other hand non-crisp gives a lower bound that is *variable* with the true distribution. We can think of non-crisp infradistirbutions as being *fuzzy* pr... (read more)

Introduction To The Infra-Bayesianism Sequence

There is some truth in that, in the sense that, your beliefs must take a form that is *learnable* rather than just a god-given system of logical relationships.

Introduction To The Infra-Bayesianism Sequence

Am I right though that in the case of e.g. Newcomb's problem, if you use the anti-Nirvana trick (getting -infinity reward if the prediction is wrong), then you would still recover the same behavior (EDIT: if you also use best-case reasoning instead of worst-case reasoning)?

Yes

... (read more)imagine that you know that the even bits in an infinite bitsequence come from a fair coin, but the odd bits come from some other agent, where you can't model them exactly but you have some suspicion that they are a bit more likely to choose 1 over 0. Risk aversion might involve m

22moI guess my question is more like: shouldn't there be some aspect of reality that
determines what my set of a-measures is? It feels like here we're finding a set
of a-measures that rationalizes my behavior, as opposed to choosing a set of
a-measures based on the "facts" of the situation and then seeing what behavior
that implies.
I feel like we agree on what the technical math says, and I'm confused about the
philosophical implications. Maybe we should just leave the philosophy alone for
a while.

Introduction To The Infra-Bayesianism Sequence

it's basically trying to think about the statistics of environments rather than their internals

That's not really true because the structure of infra-environments reflects the structure of those Newcombian scenarios. This means that the *sample complexity* of learning them will likely scale with their intrinsic complexity (e.g. some analogue of RVO dimension). This is different from treating the environment as a black-box and converging to optimal behavior by pure trial and error, which would yield much worse sample complexity.

12moI agree that infra-bayesianism isn't just thinking about sampling properties,
and maybe 'statistics' is a bad word for that. But the failure on transparent
Newcomb without kind of hacky changes to me suggests a focus on "what actions
look good thru-out the probability distribution" rather than on "what
logically-causes this program to succeed".

Introduction To The Infra-Bayesianism Sequence

The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment...

That's certainly one way to motivate IB, however I'd like to note that even if there *was* a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity (in particular this must be the case if the environment contains other agents of similar or greater complexity).

... (read more)The contribution of infra-Baye

32moYeah, agreed. I'm intentionally going for a simplified summary that sacrifices
details like this for the sake of cleaner narrative.
Ah, whoops. Live and learn.
Okay, that part makes sense. Am I right though that in the case of e.g.
Newcomb's problem, if you use the anti-Nirvana trick (getting -infinity reward
if the prediction is wrong), then you would still recover the same behavior
(EDIT: if you also use best-case reasoning instead of worst-case reasoning)? (I
think I was a bit too focused on the specific UDT / Nirvana trick ideas.)
Yeah... I'm a bit confused about this. If you imagine choosing any concave
expectation functional, then I agree that can model basically any type of risk
aversion. But it feels like your infra-distribution should "reflect reality" or
something along those lines, which is an extra constraint. If there's a "reflect
reality" constraint and a "risk aversion" constraint and these are completely
orthogonal, then it seems like you can't necessarily satisfy both constraints at
the same time.
On the other hand, maybe if I thought about it for longer, I'd realize that the
things we think of as "risk aversion" are actually identical to the "reflect
reality" constraint when we are allowed to have Knightian uncertainty over some
properties of the environment. In that case I would no longer have my objection.
To be a bit more concrete: imagine that you know that the even bits in an
infinite bitsequence come from a fair coin, but the odd bits come from some
other agent, where you can't model them exactly but you have some suspicion that
they are a bit more likely to choose 1 over 0. Risk aversion might involve
making a small bet that you'd see a 1 rather than a 0 in some specific odd bit
(smaller than what EU maximization / Bayesian decision theory would recommend),
but "reflecting reality" might recommend having Knightian uncertainty about the
output of the agent which would mean never making a bet on the outputs of the
odd bits.
I am curious

Formal Solution to the Inner Alignment Problem

It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn't mean that I think reality probably works that way. So I don't see how to salvage this kind of argument.

I think it works differently. What you should get is an infra-Bayesian hypothesis which models only those parts of reality that can be modeled within the given computing resources. More generally, if you don't end... (read more)

Formal Solution to the Inner Alignment Problem

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don't. ~~In the latter case, shouldn't be large.~~ In the former case, it means that we are overwhelming likely to actually be inside a malign simulation. But, then AI risk is the least of our troubles. (In particular... (read more)

52moIt seems like the simplest algorithm that makes good predictions and runs on
your computer is going to involve e.g. reasoning about what aspects of reality
are important to making good predictions and then attending to those. But that
doesn't mean that I think reality probably works that way. So I don't see how to
salvage this kind of argument.
It seems to me like this requires a very strong match between the priors we
write down and our real priors. I'm kind of skeptical about that a priori, but
then in particular we can see lots of ways in which attackers will be exploiting
failures in the prior we write down (e.g. failure to update on logical facts
they observe during evolution, failure to make the proper anthropic update, and
our lack of philosophical sophistication meaning that we write down some
obviously "wrong" universal prior).
Do we have any idea how to write down such an algorithm though? Even granting
that the malign hypothesis does so it's not clear how we would (short of being
fully epistemically competitive); but moreover it's not clear to me the malign
hypothesis faces a similar version of this problem since it's just thinking
about a small list of hypotheses rather than trying to maintain a broad enough
distribution to find all of them, and beyond that it may just be reasoning
deductively about properties of the space of hypotheses rather than using a
simple algorithm we can write down.

Vanessa Kosoy's Shortform

I retracted part of that, see the edit.

Vanessa Kosoy's Shortform

*Probably not too original but I haven't seen it clearly written anywhere*.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

**Amplifying by objective time:** The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to in... (read more)

11moMore observations about this attack vector ("attack from counterfactuals"). I
focus on "amplifying by subjective time".
* The harder the takeoff the more dangerous this attack vector: During every
simulation cycle, ability to defend against simulated malign AI depends on
the power of the defense system in the beginning of the cycle[1]
[#fn-tXbSLC7xHHD63K7Kq-1]. On the other hand, the capability of the attacker
depends on its power in the end of the cycle. Therefore, if power grows very
fast this is bad news for the defender. On the other hand, if power grows
very slowly, the defender wins because defense is probably easier.
* Inner control of anchor makes system safer: Given a fixed objective time
anchor, the more times the simulation goes through a cycle, the higher its
exposure to the threat, since improving the defense system takes time even if
you have all the knowledge. Therefore, we want to give the simulated user the
ability to move the objective time anchor forward in time, in order to
benefit from improvements in the defense system.
* Additional information about the external world makes system safer: Suppose
our subjective probability of a malign AI successfully attacking during a
single simulation cycle is 0.1%. Then performing 1000 cycles superficially
yields a failure probability of about 37%. However, this is only the case if
attacks on different cycles are independent random variables. If the AI knows
as much about the possibility of such an attack as we do, this is a justified
assumption. However, if the AI knows more, then, from our subjective
perspective, these variables are positively correlated. In particular, if the
AI knows with certainty whether a successful attack would occur or not the
failure probability over 1000 cycles goes back to 0.1%. Therefore, it is
beneficial to give the AI extra information about the world, just so it can
estimate the probability of attack with m

22moI think this would make a good top-level post. I have the feeling I’ll want to
link to it later.

HCH Speculation Post #2A

[**EDIT**: After thinking about this some more, I realized that malign AI leakage is a bigger problem than I thought when writing the parent comment, because the way I imagined it can be overcome doesn't work that well.]

I think the biggest differences are that HCH is a psychological "monoculture," HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there's some presumption that the output will be "an answer" whereas I have no such demands on the brain-state I pass to tomorrow.

I don't think t... (read more)

Intermittent Distillations #1

I think the Armstrong and Minderman NFL result is very weak. Obviously inferring values requires assuming that the planning algorithm is trying to maximize those values *in some sense*, and they don't have such an assumption. IMO my AIT definition of intelligence shows a clear path to solving this. That said I'm not at all sure this is enough to get alignment without full access to the human policy.

HCH Speculation Post #2A

IMO the strongest argument in favor of imitation-based solutions is: if there is any way to solve the alignment problem which we can plausibly come up with, then a sufficiently reliable and amplified imitation of us will also find this solution. So, if the imitation is doomed to produce a bad solution or end up in a strange attractor, then our own research is also doomed in the same way anyway. Any counter-argument to this would probably have to be based on one of the following:

- Maybe the imitation is different from how we do alignment research in importa

12moYeah, I agree with this. It's certainly possible to see normal human passage
through time as a process with probable attractors. I think the biggest
differences are that HCH is a psychological "monoculture," HCH has tiny
bottlenecks through which to pass messages compared to the information I can
pass to my future self, and there's some presumption that the output will be "an
answer" whereas I have no such demands on the brain-state I pass to tomorrow.
If we imagine actual human imitations I think all of these problems have fairly
obvious solutions, but I think the problems are harder to solve if you want IDA
approximations of HCH. I'm not totally sure what you meant by the confidence
thresholds link - was it related to this?
The monoculture problem seems like it should increase the size ("size" meaning
attraction basin, not measure of the equilibrium set), lifetime, and weirdness
of attractors, while the restrictions and expectations on message-passing seem
like they might shift the distribution away from "normal" human results.
But yeah, in theory we could use imitiation humans to do any research we could
do ourselves. I think that gets into the relative difficulty of super-speed
imitations of humans doing alignment research versus transformative AI, which
I'm not really an expert in.

Vanessa Kosoy's Shortform

I propose a new formal desideratum for alignment: the *Hippocratic principle*. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the *user's* beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as... (read more)

12moI don't understand what you mean here by quantilizing. The meaning I know is to
take a random action over the top \alpha actions, on a given base distribution.
But I don't see a distribution here, or even a clear ordering over actions
(given that we don't have access to the utility function).
I'm probably missing something obvious, but more details would really help.

"Beliefs" vs. "Notions"

Hmm. I didn't encounter this terminology before, but, given a graph and a factor you can consider the convex hull of all probability distributions compatible with this graph and factor (i.e. all probability distributions obtained by assigning other factors to the other cliques in the graph). This is a crisp infradistribution. So, in this sense you can say factors are a special case of infradistributions (although I don't know how much information this transformation loses).

It's more natural to consider, instead of a factor, either the marginal probability ... (read more)

"Beliefs" vs. "Notions"

The concept of infradistribution was defined here (Definition 7) although for the current purpose it's sufficient to use crisp infradistributions (Definition 9 here, it's just a compact convex set of probability distributions). Sharp infradistributions (Definition 10 here) are the special case of "pure (2)". I also talked about the connection to formal logic here.

12moThanks! Quick question: how do you think these notions compare to factors in an
undirected graphical model? (This is the closest thing I know of to how I
imagine "notions" being formalized).

"Beliefs" vs. "Notions"

Infadistributions are exactly the natural hybrid of the two. I also associate (2) with formal logic.

12moCool! Can you give a more specific link please?

Formal Solution to the Inner Alignment Problem

When talking about uniform (worst-case) bounds, realizability just means the true environment is in the hypothesis class, but in a Bayesian setting (like in the OP) it means that our bounds scale with the probability of the true environment in the prior. Essentially, it means we can pretend the true environment was sampled from the prior. So, if (by design) training works by sampling environments from the prior, and (by realizability) deployment *also* consists of sampling an environment from the same prior, training and deployment are indistinguishable.

22moSure—by that definition of realizability, I agree that's where the difficulty
is. Though I would seriously question the practical applicability of such an
assumption.

Formal Solution to the Inner Alignment Problem

Okay, but why? I think that the reason you have this intuition is, the realizability assumption is false. But then you should concede that the main weakness of the OP is the realizability assumption rather than the difference between deep learning and Bayesianism.

22moPerhaps I just totally don't understand what you mean by realizability, but I
fail to see how realizability is relevant here. As I understand it,
realizability just says that the true model has some non-zero prior
probability—but that doesn't matter (at least for the MAP, which I think is a
better model than the full posterior for how SGD actually works) as long as
there's some deceptive model with greater prior probability that's
indistinguishable on the training distribution, as in my simple toy model from
earlier.

Formal Solution to the Inner Alignment Problem

I understand how this model explains why agents become unaligned under distributional shift. That's something I never disputed. However, I don't understand how this model applies to my proposal. In my proposal, there is *no* distributional shift, because (thanks to the realizability assumption) the real environment is sampled from the same prior that is used for training with . The model can't choose to act deceptively during training, because it can't distinguish between training and deployment. Moreover, the objective I described is *not* complicated.

22moYeah, that's a fair objection—my response to that is just that I think that
preventing a model from being able to distinguish training and deployment is
likely to be impossible for anything competitive.

Formal Solution to the Inner Alignment Problem

The reward function I was sketching earlier is not complex. Moreover, if you can encode then you can encode , no reason why the latter should have much greater complexity. If you can't encode but can only encode something that *produces* , then by the same token you can encode something that produces (I don't think that's even a meaningful distinction tbh). I think it would help if you could construct a simple mathematical toy model of your reasoning here?

22moHere's a simple toy model. Suppose you have two agents that internally compute
their actions as follows (perhaps with actual argmax replaced with some smarter
search algorithm, but still basically structured as below):
Mdeceptive(x)=argmaxaE[∑iUdeceptive(si)|a]Maligned(x)=argmaxaE[∑iUaligned(si)|a]
Then, comparing the K-complexity of the two models, we get
K(Maligned)−K(Mdeceptive)≈K(Ualigned)−K(Udeceptive)
and the problem becomes that both Mdeceptive and Maligned will produce behavior
that looks aligned on the training distribution, but Ualigned has to be much
more complex. To see this, note that essentially any Udeceptive will yield good
training performance because the model will choose to act deceptively during
training, whereas if you want to get good training performance without
deception, then Ualigned has to actually encode the full objective, which is
likely to make it quite complicated.

Formal Solution to the Inner Alignment Problem

Why? It seems like would have all the knowledge required for achieving good rewards of the good type, so simulating should not be more difficult than achieving good rewards of the good type.

22moIt's not that simulating M is difficult, but that encoding for some complex goal
is difficult, whereas encoding for a random, simple goal is easy.

Formal Solution to the Inner Alignment Problem

IIUC, you're saying something like: suppose trained- computes the source code of the complex core of and then runs it. But then, define as: compute the source code of the complex core of (in the same way does it) and use it to implement . is equivalent to and has about the same complexity as trained-.

Or, from a slightly different angle: if "think up good strategies for achieving " is powerful enough to come up with , then "think up good strategies for achieving [reward of the type I defined earlier]" is powerful enough to come up with .... (read more)

22moI think that “think up good strategies for achieving [reward of the type I
defined earlier]” is likely to be much, much more complex (making it much more
difficult to achieve with a local search process) than an arbitrary goal X for
most sorts of rewards that we would actually be happy with AIs achieving.

Formal Solution to the Inner Alignment Problem

Hmm, sorry, I'm not following. What exactly do you mean by "inference-time" and "derived"? By assumption, when you run on some sequence it effectively simulates which runs the complex core of . So, trained on that sequence effectively contains the complex core of as a subroutine.

22moA's structure can just be “think up good strategies for achieving X, then do
those,” with no explicit subroutine that you can find anywhere in A's weights
that you can copy over to B.

Formal Solution to the Inner Alignment Problem

Sure, but then I think B is likely to be significantly more complex and harder for a local search process to find than A.

is sufficiently powerful to select which *contains* the complex part of . It seems rather implausible that an algorithm of the same power cannot select .

if we look at actually successful current architectures, like CNNs or transformers, they're designed to work well on specific types of data and relationships that are common in our world—but not necessarily at all just in a general simplicity prior.

CNNs *are* specific in some wa... (read more)

32moA's weights do not contain the complex part of B—deception is an inference-time
phenomenon. It's very possible for complex instrumental goals to be derived from
a simple structure such that a search process is capable of finding that simple
structure that yields those complex instrumental goals without being able to
find a model with those complex instrumental goals hard-coded as terminal goals.

Formal Solution to the Inner Alignment Problem

If we let A = SGD or A = evolution, your first claim becomes “if SGD/evolution finds a malign model, it must understand that it's malign on some level,” which seems just straightforwardly incorrect.

To me it seems straightforwardly correct! Suppose you're running evolution in order to predict a sequence. You end up evolving a mind M that is a superintelligent malign consequentialist: it makes good predictions on purpose, in order to survive, and then produces a malign false prediction at a critical moment. So, M is *part of the state of your algorithm* A. ... (read more)

32moSure, but then I think B is likely to be significantly more complex and harder
for a local search process to find than A.
I definitely don't think this, unless you have a very strong (and likely
unachievable imo) definition of “good.”
I guess I'm skeptical that you can do all that much in the fully generic setting
of just trying to predict a simplicity prior. For example, if we look at
actually successful current architectures, like CNNs or transformers, they're
designed to work well on specific types of data and relationships that are
common in our world—but not necessarily at all just in a general simplicity
prior.

Formal Solution to the Inner Alignment Problem

Hmm, maybe you misunderstood my proposal? I suggested to train the model by meta-learning on purely synthetic data, sampled from some kind of simplicity prior, without any interface to the world. Maybe you just think this wouldn't be competitive? If so, why? Is the argument just that there are no existing systems like this? But then it's weak evidence at best. On the contrary, even from a purely capability standpoint, meta-learning with synthetic data might be a promising strategy to lower deep learning's sample complexity.

Here's why I think it *will* be com... (read more)

22moThis seems very sketchy to me. If we let A = SGD or A = evolution, your first
claim becomes “if SGD/evolution finds a malign model, it must understand that
it's malign on some level,” which seems just straightforwardly incorrect. The
last claim also seems pretty likely to be wrong if you let C = SGD or C =
evolution.
Moreover, it definitely seems like training on data sampled from a simplicity
prior (if that's even possible—it should be uncomputable in general) is unlikely
to help at all. I think there's essentially no realistic way that training on
synthetic data like that will be sufficient to produce a model which is capable
of accomplishing things in the world. At best, that sort of approach might give
you better inductive biases in terms of incentivizing the right sort of
behavior, but in practice I expect any impact there to basically just be swamped
by the impact of fine-tuning on real-world data.

Formal Solution to the Inner Alignment Problem

I expect it to be extremely difficult if not impossible to prevent a model that you want to be able to operate in the real world from being able to determine at what point in training/deployment it's in.

Why do you expect it? During training, it finds itself in random universe. During deployment, it finds itself in another universe drawn from the same prior (the realizability assumption). How would it determine the difference?

42moBecause that's never what machine learning actually looks like in
practice—essentially any access to text from the internet (let alone actual
ability to interface with the world, both of which seem necessary for
competitive systems) will let it determine things like the year, whether
RSA-2048 has been factored, or other various pieces of information that are
relevant to what stage in training/testing/deployment it's in, how powerful the
feedback mechanisms keeping it in check are, whether other models are defecting,
etc. that can let it figure out when to defect.

Formal Solution to the Inner Alignment Problem

I think that adversarial training working so well that it can find exponentially rare failures is an unnecessarily strong desideratum. We need to drive to probability of catastrophic failures to something very low, but not actually zero. If a system is going to work for time during deployment, then running it in different synthetic environments for time during training is enough to drive the probability of failure down to . Now, this is prohibitively costly, but it's not *exponentially* prohibitively costly. And this doesn't use adversarial traini... (read more)

32moI agree that random defection can potentially be worked around—but the RSA-2048
problem is about conditional defection, which can't be dealt with in the same
way. More generally, I expect it to be extremely difficult if not impossible to
prevent a model that you want to be able to operate in the real world from being
able to determine at what point in training/deployment it's in.

Formal Solution to the Inner Alignment Problem

We're doing *meta-learning*. During training, the network is not learning about the real world, it's learning how to be a safe predictor. It's interacting with a *synthetic* environment, so a misprediction doesn't have any catastrophic effects: it only teaches the algorithm that this version of the predictor is unsafe. In other words, the malign subagents have no way to attack during training because they can access little information about what the real universe is like. The training process is designed to select predictors that only make predictions when they can be confident, and the training performance allows us to verify this goal has truly been achieved.

Formal Solution to the Inner Alignment Problem

Right, but but you can look at the performance of your model in training, compare it to the theoretical optimum (and to the baseline of making no predictions at all) and get lots of evidence about safety from that. You can even add some adversarial training of the synthetic environment in order to get tighter bounds. If on the vast majority of synthetic environments your model makes virtually no mispredictions, then, under the realizability assumption, it is very unlikely to make mispredictions in deployment. Ofc the realizability assumption should also be questioned: but that's true in the OP as well, so it's not a difference between Bayesianism and deep.

22moNote that adversarial training doesn't work on deceptive models due to the
RSA-2048 problem
[https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d]; also see
more detail here
[https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment#Hard_cases]
.
I think realizability basically doesn't help here—as long as there's a deceptive
model which is easier to find according to your inductive biases, the fact that
somewhere in the model space there exists a correct model, but not one that your
local search process finds by default, is cold comfort.

12moWhat's the distinction between training and deployment when the model can always
query for more data?

Formal Solution to the Inner Alignment Problem

You have no guarantees, sure, but that's a problem with deep learning in general and not just inner alignment. The point is, if your model is *not* optimizing that reward function then *its performance during training will be suboptimal*. To the extent your algorithm is able to approximate the true optimum during training, it will behave safely during deployment.

23moI think you are understanding inner alignment very differently than we define it
in Risks from Learned Optimization
[https://www.alignmentforum.org/s/r9tYkB2a8Fp4DN8yB], where we introduced the
term.
This is not true for deceptively aligned models, which is the situation I'm most
concerned about, and—as we argue extensively in Risks from Learned Optimization
[https://www.alignmentforum.org/s/r9tYkB2a8Fp4DN8yB]—there are a lot of reasons
why a model might end up pursuing a simpler/faster/easier-to-find proxy even if
that proxy yields suboptimal training performance.

Formal Solution to the Inner Alignment Problem

Here's another way how you can try implementing this approach with deep learning. Train the predictor using meta-learning on synthetically generated environments (sampled from some reasonable prior such as bounded Solomonoff or maybe ANNs with random weights). The reward for making a prediction is , where is the predicted probability of outcome , is the true probability of outcome and are parameters. The reward for making no prediction (i.e. querying the user) is .

This particular proposal is probably not quite right, ... (read more)

33moSure, but you have no guarantee that the model you learn is actually going to be
optimizing anything like that reward function—that's the whole point of the
inner alignment problem. What's nice about the approach in the original paper is
that it keeps a bunch of different models around, keeps track of their
posterior, and only acts on consensus, ensuring that the true model always has
to approve. But if you just train a single model on some reward function like
that with deep learning, you get no such guarantees.

Formal Solution to the Inner Alignment Problem

Hmm, added to reading list, thank you.

Formal Solution to the Inner Alignment Problem

Here's the sketch of a solution to the query complexity problem.

**Simplifying technical assumptions:**

- The action space is
- All hypotheses are deterministic
- Predictors output maximum likelihood predictions instead of sampling the posterior

I'm pretty sure removing those is mostly just a technical complication.

**Safety assumptions:**

- The real hypothesis has prior probability lower bounded by some known quantity , so we discard all hypotheses of probability less than from the onset.
- Malign hypotheses have total prior probability mass upper bounded by some

62moI agree that this settles the query complexity question for Bayesian predictors
and deterministic humans.
I expect it can be generalized to have complexityO(ϵ2δ2)in the case with
stochastic humans where treacherous behavior can take the form of small
stochastic shifts.
I think that the big open problems for this kind of approach to inner alignment
are:
* Isϵδbounded? I assign significant probability to it being2100or more, as
mentioned in the other thread between me and Michael Cohen, in which case
we'd have trouble. (I believe this is also Eliezer's view.)
* It feels like searching for a neural network is analogous to searching for a
MAP estimate, and that more generally efficient algorithms are likely to just
run one hypothesis most of the time rather than running them all. Then this
algorithm increases the cost of inference byϵδwhich could be a big problem.
* As you mention, is it safe to wait and defer or are we likely to have a
correlated failure in which all the aligned systems block simultaneously?
(e.g. as described here
[https://ordinaryideas.wordpress.com/2015/11/30/driving-fast-in-the-counterfactual-loop/]
)

33moThis is very nice and short!
And to state what you left implicit:
Ifp0>p1+ε, then in the setting with no malign hypotheses (which you assume to be
safe), 0 is definitely the output, since the malign models can only shift the
outcome byε, so we assume it is safe to output 0. And likewise with outputting
1.
One general worry I have about assuming that the deterministic case extends
easily to the stochastic case is that a sequence of probabilities that tends to
0 can still have an infinite sum, which is not true when probabilities must∈{0,1
}, and this sometimes causes trouble. I'm not sure this would raise any issues
here--just registering a slightly differing intuition.

Formal Solution to the Inner Alignment Problem

I mean, there's obviously a lot more work to do, but this is progress. Specifically if SGD is MAP then it seems plausible that e.g. SGD + random initial conditions or simulated annealing would give you something like top N posterior models. You can also extract confidence from NNGP.

33moI agree that this is progress (now that I understand it better), though:
I think there is strong evidence that the behavior of models trained via the
same basic training process are likely to be highly correlated. This sort of
correlation is related to low variance in the bias-variance tradeoff sense, and
there is evidence that [https://arxiv.org/abs/2002.11328] not only do massive
neural networks tend to have pretty low variance, but that this variance is
likely to continue to decrease as networks become larger.

Yes, I think TRL captures this notion. You have some Knightian uncertainty about the world, and some Knightian uncertainty about the result of a computation, and the two are entangled.