All of Vanessa Kosoy's Comments + Replies

My Current Take on Counterfactuals

In particular, it's easy to believe that some computation knows more than you.

Yes, I think TRL captures this notion. You have some Knightian uncertainty about the world, and some Knightian uncertainty about the result of a computation, and the two are entangled.

My Current Take on Counterfactuals

I lean towards some kind of finitism or constructivism, and am skeptical of utility functions which involve unbounded quantifiers. But also, how does LI help with the procrastination paradox? I don't think I've seen this result.

My Current Take on Counterfactuals

Yes, I'm pretty sure we have that kind of completeness. Obviously representing all hypotheses in this opaque form would give you poor sample and computational complexity, but you can do something midway: use black-box programs as components in your hypothesis but also have some explicit/transparent structure.

Updating the Lottery Ticket Hypothesis

IIUC, here's a simple way to test this hypothesis: initialize a random neural network, and then find the minimal loss point in the tangent space. Since the tangent space is linear, this is easy to do (i.e. doesn't require heuristic gradient descent): for square loss it's just solving a large linear system once, for many other losses it should amount to convex optimization for which we have provable efficient algorithms. And, I guess it's underdetermined so you add some regularization. Is the result about as good as normal gradient descent in the actual parameter space? I'm guessing some of the linked papers might have done something like this?

3johnswentworth25dThis basically matches my current understanding. (Though I'm not strongly confident in my current understanding.) I believe the GP results are basically equivalent to this, but I haven't read up on the topic enough to be sure.
My Current Take on Counterfactuals

So we have this nice picture, where rationality is characterized by non-exploitability wrt a specific class of potential exploiters.

I'm not convinced this is the right desideratum for that purpose. Why should we care about exploitability by traders if making such trades is not actually possible given the environment and the utility function? IMO epistemic rationality is subservient to instrumental rationality, so our desiderata should be derived from the later.

Human value-uncertainty is not particularly well-captured by Bayesian uncertainty, as I imag

... (read more)
2Abram Demski23dIt's clear that you understand logical induction pretty well, so while I feel like you're missing something, I'm not clear on what that could be. I think maybe the more fruitful branch of this conversation (as opposed to me trying to provide an instrumental justification for radical probabilism, though I'm still interested in that) is the question of describing the human utility function. The logical induction picture isn't strictly at odds with a platonic utility function, I think, since we can consider the limit. (I only claim that this isn't the best way to think about it in general, since Nature didn't decide a platonic utility function for us and then design us such that our reasoning has the appropriate limit.) For example, one case which to my mind argues in favor of the logical induction approach to preferences: the procrastination paradox. All you want to do is ensure that the button is pressed at some point. This isn't a particularly complex or unrealistic preference for an agent to have. Yet, it's unclear how to make computable beliefs think about this appropriately. Logical induction provides a theory about how to think about this kind of goal. (I haven't thought much about how TRL would handle it.) Agree or disagree: agents can sensibly pursueΔ2objectives? And, do you think that question is cruxy for you?
2Abram Demski24dSo, one point is that the InfraBayes picture still gives epistemics an important role: the kind of guarantee arrived at is a guarantee that you won't do too much worse than the most useful partial model expects. So, we can think about generalized partial models which update by thinking longer in addition to taking in sense-data. I suppose TRL can model this by observing what those computations would say, in a given situation, and using partial models which only "trust computation X" rather than having any content of their own. Is this "complete" in an appropriate sense? Can we always model a would-be radical-infrabayesian as a TRL agent observing what that radical-infrabayesian would think? Even if true, there may be a significant computational complexity gap between just doing the thing vs modeling it in this way.
3Abram Demski1moThis does make sense to me, and I view it as a weakness of the idea. However, the productivity of dutch-book type thinking in terms of implying properties which seem appealing for other reasons speaks heavily in favor of it, in my mind. A formal connection to more pragmatic criteria would be great. But also, maybe I can articulate a radical-probabilist position without any recourse to dutch books... I'll have to think more about that. I'm not sure how to double crux with this intuition, unfortunately. When I imagine the perspective you describe, I feel like it's rolling all dynamic inconsistency into time-preference and ignoring the role of deliberation. My claim is that there is a type of change-over-time which is due to boundedness, and which looks like "dynamic inconsistency" from a classical bayesian perspective, but which isn't inherently dynamically inconsistent. EG, if you "sleep on it" and wake up with a different, firmer-feeling perspective, without any articulable thing you updated on. (My point isn't to dogmatically insist that you haven't updated on anything, but rather, to point out that it's useful to have the perspective where we don't need to suppose there was evidence which justifies the update as Bayesian, in order for it to be rational.)
My Current Take on Counterfactuals

I guess we can try studying Troll Bridge using infra-Bayesian modal logic, but atm I don't know what would result.

From a radical-probabilist perspective, the complaint would be that Turing RL still uses the InfraBayesian update rule, which might not always be necessary to be rational (the same way Bayesian updates aren't always necessary).

Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).

3Abram Demski1moI agree that radical probabilism can be thought of as bayesian-with-a-side-channel, but it's nice to have a more general characterization where the side channel is black-box, rather than an explicit side-channel which we explicitly update on. This gives us a picture of the space of rational updates. EG, the logical induction criterion allows for a large space of things to count as rational. We get to argue for constraints on rational behavior by pointing to the existence of traders which enforce those constraints, while being agnostic about what's going on inside a logical inductor. So we have this nice picture, where rationality is characterized by non-exploitability wrt a specific class of potential exploiters. Here's an argument for why this is an important dimension to consider: 1. Human value-uncertainty is not particularly well-captured by Bayesian uncertainty, as I imagine you'll agree. One particular complaint is realizability: we have no particular reason to assume that human preferences are within any particular space of hypotheses we can write down. 2. One aspect of this can be captured by InfraBayes: it allows us to eliminate the realizability assumption, instead only assuming that human preferences fall within some set of constraints which we can describe. 3. However, there is another aspect to human preference-uncertainty: human preferences change over time. Some of this is irrational, but some of it is legitimate philosophical deliberation. 4. And, somewhat in the spirit of logical induction, humans do tend to eventually address the most egregious irrationalities. 5. Therefore, I tend to think that toy models of alignment (such as CIRL, DRL, DIRL) should model the human as a radical probabilist; not because it's a perfect model, but because it constitutes a major incremental improvement wrt modeling what kind of uncertainty humans have over our own preferences. Recognizing preferences as a thing whic
My Current Take on Counterfactuals

I only skimmed this post for now, but a few quick comments on links to infra-Bayesianism:

InfraBayes doesn’t seem to have that worry, since it applies to non-realizable cases. (Or does it? Is there some kind of non-oscillation guarantee? Or is non-oscillation part of what it means for a set of environments to be learnable -- IE it can oscillate in some cases?)... AFAIK the conditions for learnability in the InfraBayes case are still pretty wide open.

It's true that these questions still need work, but I think it's rather clear that something like "there ... (read more)

Is there a way to operationalize "respecting logic"? For example, a specific toy scenario where an infra-Bayesian agent would fail due to not respecting logic?

"Respect logic" means either (a) assigning probability one to tautologies (at least, to those which can be proved in some bounded proof-length, or something along those lines), or, (b) assigning probability zero to contradictions (again, modulo boundedness). These two properties should be basically equivalent (ie, imply each other) provided the proof system is consistent. If it's inconsistent, they i... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

From your reply to Paul, I understand your argument to be something like the following:

  1. Any solution to single-single alignment will involve a tradeoff between alignment and capability.
  2. If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
  3. If AI systems are designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff.
  4. Given the technical knowledge to design c
... (read more)
Formal Solution to the Inner Alignment Problem

I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.

I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect ... (read more)

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research. It seems like the failures described happened because the AI systems were misaligned in the usual "unipolar" sense. These management assistants, DAOs etc are not aligned to the goals of their respective, individual users/owners.

I do see two reasons why multipolar scenarios might require more technical research:

  1. Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient wa
... (read more)
5Andrew Critch1moI don't mean to say this post warrants a new kind of AI alignment research, and I don't think I said that, but perhaps I'm missing some kind of subtext I'm inadvertently sending? I would say this post warrants research on multi-agent RL and/or AI social choice and/or fairness and/or transparency, none of which are "new kinds" of research (I promoted them heavily in my preceding post), and none of which I would call "alignment research" (though I'll respect your decision to call all these topics "alignment" if you consider them that). I would say, and I did say: I do hope that the RAAP concept can serve as a handle for noticing structure in multi-agent systems, but again I don't consider this a "new kind of research", only an important/necessary/neglected kind of research for the purposes of existential safety. Apologies if I seemed more revolutionary than intended. Perhaps it's uncommon to take a strong position of the form "X is necessary/important/neglected for human survival" without also saying "X is a fundamentally new type of thinking that no one has done before", but that is indeed my stance for X∈{a variety of non-alignment AI research areas [https://www.lesswrong.com/posts/hvGoYXi2kgnS3vxqb/some-ai-research-areas-and-their-relevance-to-existential-1] }.
2Andrew Critch1moHow are you inferring this? From the fact that a negative outcome eventually obtained? Or from particular misaligned decisions each system made? It would be helpful if you could point to a particular single-agent decision in one of the stories that you view as evidence of that single agent being highly misaligned with its user or creator. I can then reply with how I envision that decision being made even with high single-agent alignment. Yes, this^.
Formal Solution to the Inner Alignment Problem

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

Yes, you're right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the "bridge rules" by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.

Here's the sketch of a proposal how to solve this. Let's construct our prior to be ... (read more)

6Paul Christiano1moI broadly think of this approach as "try to write down the 'right' universal prior." I don't think the bridge rules / importance-weighting consideration is the only way in which our universal prior is predictably bad. There are also issues like anthropic update and philosophical considerations about what kind of "programming language" to use and so on. I'm kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit. I guess you just need to get close enough thatεδis manageable but I think I still find it scary (and don't totally remember all my sources of concern). I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior. That is, if someone inside one of our hypotheses has noticed that e.g. a certain class of decisions is more important and so they will simulate only those situations, then we should also notice this and by the same token care more about our decision if we are in one of those situations (rather than using a universal prior without importance weighting). My sense is that without competitiveness we are in trouble anyway on other fronts, and so it is probably also reasonable to think of as a first-line defense against this kind of issue. This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with "giant" universes that do all the possible computations you would want, and then using the "free" complexity in the bridge rules to pick which of the computations you actually wanted. I am not sure if the DFA proposal gets around this kind of problem though it sounds like it would be pretty similar.
Vanessa Kosoy's Shortform

So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small?

Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is and your quantilization fraction is then the AI's probability of corrupt... (read more)

Vanessa Kosoy's Shortform

More observations about this attack vector ("attack from counterfactuals"). I focus on "amplifying by subjective time".

  • The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle[1]. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defende
... (read more)
Inframeasures and Domain Theory

Virtually all the credit for this post goes to Alex, I think the proof of Proposition 1 was more or less my only contribution.

Vanessa Kosoy's Shortform

The distribution is the user's policy, and the utility function for this purpose is the eventual success probability estimated by the user (as part of the timeline report), in the end of the "maneuver". More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.

1Adam Shimi1moOh, right, that makes a lot of sense. So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small? I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?
Introduction To The Infra-Bayesianism Sequence

IIUC your question can be reformulated as follows: a crisp infradistribution can be regarded as a claim about reality (the true distribution is inside the set), but it's not clear how to generalize this to non-crisp. Well, if you think in terms of desiderata, then crisp says: if distribution is inside set then we have some lower bound on expected utility (and if it's not then we don't promise anything). On the other hand non-crisp gives a lower bound that is variable with the true distribution. We can think of non-crisp infradistirbutions as being fuzzy pr... (read more)

Introduction To The Infra-Bayesianism Sequence

There is some truth in that, in the sense that, your beliefs must take a form that is learnable rather than just a god-given system of logical relationships.

Introduction To The Infra-Bayesianism Sequence

Am I right though that in the case of e.g. Newcomb's problem, if you use the anti-Nirvana trick (getting -infinity reward if the prediction is wrong), then you would still recover the same behavior (EDIT: if you also use best-case reasoning instead of worst-case reasoning)?

Yes

imagine that you know that the even bits in an infinite bitsequence come from a fair coin, but the odd bits come from some other agent, where you can't model them exactly but you have some suspicion that they are a bit more likely to choose 1 over 0. Risk aversion might involve m

... (read more)
2Rohin Shah2moI guess my question is more like: shouldn't there be some aspect of reality that determines what my set of a-measures is? It feels like here we're finding a set of a-measures that rationalizes my behavior, as opposed to choosing a set of a-measures based on the "facts" of the situation and then seeing what behavior that implies. I feel like we agree on what the technical math says, and I'm confused about the philosophical implications. Maybe we should just leave the philosophy alone for a while.
Introduction To The Infra-Bayesianism Sequence

it's basically trying to think about the statistics of environments rather than their internals

That's not really true because the structure of infra-environments reflects the structure of those Newcombian scenarios. This means that the sample complexity of learning them will likely scale with their intrinsic complexity (e.g. some analogue of RVO dimension). This is different from treating the environment as a black-box and converging to optimal behavior by pure trial and error, which would yield much worse sample complexity.

1DanielFilan2moI agree that infra-bayesianism isn't just thinking about sampling properties, and maybe 'statistics' is a bad word for that. But the failure on transparent Newcomb without kind of hacky changes to me suggests a focus on "what actions look good thru-out the probability distribution" rather than on "what logically-causes this program to succeed".
Introduction To The Infra-Bayesianism Sequence

The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment...

That's certainly one way to motivate IB, however I'd like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity (in particular this must be the case if the environment contains other agents of similar or greater complexity).

The contribution of infra-Baye

... (read more)
3Rohin Shah2moYeah, agreed. I'm intentionally going for a simplified summary that sacrifices details like this for the sake of cleaner narrative. Ah, whoops. Live and learn. Okay, that part makes sense. Am I right though that in the case of e.g. Newcomb's problem, if you use the anti-Nirvana trick (getting -infinity reward if the prediction is wrong), then you would still recover the same behavior (EDIT: if you also use best-case reasoning instead of worst-case reasoning)? (I think I was a bit too focused on the specific UDT / Nirvana trick ideas.) Yeah... I'm a bit confused about this. If you imagine choosing any concave expectation functional, then I agree that can model basically any type of risk aversion. But it feels like your infra-distribution should "reflect reality" or something along those lines, which is an extra constraint. If there's a "reflect reality" constraint and a "risk aversion" constraint and these are completely orthogonal, then it seems like you can't necessarily satisfy both constraints at the same time. On the other hand, maybe if I thought about it for longer, I'd realize that the things we think of as "risk aversion" are actually identical to the "reflect reality" constraint when we are allowed to have Knightian uncertainty over some properties of the environment. In that case I would no longer have my objection. To be a bit more concrete: imagine that you know that the even bits in an infinite bitsequence come from a fair coin, but the odd bits come from some other agent, where you can't model them exactly but you have some suspicion that they are a bit more likely to choose 1 over 0. Risk aversion might involve making a small bet that you'd see a 1 rather than a 0 in some specific odd bit (smaller than what EU maximization / Bayesian decision theory would recommend), but "reflecting reality" might recommend having Knightian uncertainty about the output of the agent which would mean never making a bet on the outputs of the odd bits. I am curious
Formal Solution to the Inner Alignment Problem

It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn't mean that I think reality probably works that way. So I don't see how to salvage this kind of argument.

I think it works differently. What you should get is an infra-Bayesian hypothesis which models only those parts of reality that can be modeled within the given computing resources. More generally, if you don't end... (read more)

Formal Solution to the Inner Alignment Problem

Is bounded? I assign significant probability to it being or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble.

I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don't. In the latter case, shouldn't be large. In the former case, it means that we are overwhelming likely to actually be inside a malign simulation. But, then AI risk is the least of our troubles. (In particular... (read more)

5Paul Christiano2moIt seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn't mean that I think reality probably works that way. So I don't see how to salvage this kind of argument. It seems to me like this requires a very strong match between the priors we write down and our real priors. I'm kind of skeptical about that a priori, but then in particular we can see lots of ways in which attackers will be exploiting failures in the prior we write down (e.g. failure to update on logical facts they observe during evolution, failure to make the proper anthropic update, and our lack of philosophical sophistication meaning that we write down some obviously "wrong" universal prior). Do we have any idea how to write down such an algorithm though? Even granting that the malign hypothesis does so it's not clear how we would (short of being fully epistemically competitive); but moreover it's not clear to me the malign hypothesis faces a similar version of this problem since it's just thinking about a small list of hypotheses rather than trying to maintain a broad enough distribution to find all of them, and beyond that it may just be reasoning deductively about properties of the space of hypotheses rather than using a simple algorithm we can write down.
Vanessa Kosoy's Shortform

I retracted part of that, see the edit.

Vanessa Kosoy's Shortform

Probably not too original but I haven't seen it clearly written anywhere.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to in... (read more)

1Vanessa Kosoy1moMore observations about this attack vector ("attack from counterfactuals"). I focus on "amplifying by subjective time". * The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle[1] [#fn-tXbSLC7xHHD63K7Kq-1]. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier. * Inner control of anchor makes system safer: Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system. * Additional information about the external world makes system safer: Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with m
2Alex Turner2moI think this would make a good top-level post. I have the feeling I’ll want to link to it later.
HCH Speculation Post #2A

[EDIT: After thinking about this some more, I realized that malign AI leakage is a bigger problem than I thought when writing the parent comment, because the way I imagined it can be overcome doesn't work that well.]

I think the biggest differences are that HCH is a psychological "monoculture," HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there's some presumption that the output will be "an answer" whereas I have no such demands on the brain-state I pass to tomorrow.

I don't think t... (read more)

Intermittent Distillations #1

I think the Armstrong and Minderman NFL result is very weak. Obviously inferring values requires assuming that the planning algorithm is trying to maximize those values in some sense, and they don't have such an assumption. IMO my AIT definition of intelligence shows a clear path to solving this. That said I'm not at all sure this is enough to get alignment without full access to the human policy.

HCH Speculation Post #2A

IMO the strongest argument in favor of imitation-based solutions is: if there is any way to solve the alignment problem which we can plausibly come up with, then a sufficiently reliable and amplified imitation of us will also find this solution. So, if the imitation is doomed to produce a bad solution or end up in a strange attractor, then our own research is also doomed in the same way anyway. Any counter-argument to this would probably have to be based on one of the following:

  • Maybe the imitation is different from how we do alignment research in importa
... (read more)
1Charlie Steiner2moYeah, I agree with this. It's certainly possible to see normal human passage through time as a process with probable attractors. I think the biggest differences are that HCH is a psychological "monoculture," HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there's some presumption that the output will be "an answer" whereas I have no such demands on the brain-state I pass to tomorrow. If we imagine actual human imitations I think all of these problems have fairly obvious solutions, but I think the problems are harder to solve if you want IDA approximations of HCH. I'm not totally sure what you meant by the confidence thresholds link - was it related to this? The monoculture problem seems like it should increase the size ("size" meaning attraction basin, not measure of the equilibrium set), lifetime, and weirdness of attractors, while the restrictions and expectations on message-passing seem like they might shift the distribution away from "normal" human results. But yeah, in theory we could use imitiation humans to do any research we could do ourselves. I think that gets into the relative difficulty of super-speed imitations of humans doing alignment research versus transformative AI, which I'm not really an expert in.
Vanessa Kosoy's Shortform

I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user's beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as... (read more)

1Adam Shimi2moI don't understand what you mean here by quantilizing. The meaning I know is to take a random action over the top \alpha actions, on a given base distribution. But I don't see a distribution here, or even a clear ordering over actions (given that we don't have access to the utility function). I'm probably missing something obvious, but more details would really help.
"Beliefs" vs. "Notions"

Hmm. I didn't encounter this terminology before, but, given a graph and a factor you can consider the convex hull of all probability distributions compatible with this graph and factor (i.e. all probability distributions obtained by assigning other factors to the other cliques in the graph). This is a crisp infradistribution. So, in this sense you can say factors are a special case of infradistributions (although I don't know how much information this transformation loses).

It's more natural to consider, instead of a factor, either the marginal probability ... (read more)

"Beliefs" vs. "Notions"

The concept of infradistribution was defined here (Definition 7) although for the current purpose it's sufficient to use crisp infradistributions (Definition 9 here, it's just a compact convex set of probability distributions). Sharp infradistributions (Definition 10 here) are the special case of "pure (2)". I also talked about the connection to formal logic here.

1David Krueger2moThanks! Quick question: how do you think these notions compare to factors in an undirected graphical model? (This is the closest thing I know of to how I imagine "notions" being formalized).
"Beliefs" vs. "Notions"

Infadistributions are exactly the natural hybrid of the two. I also associate (2) with formal logic.

1David Krueger2moCool! Can you give a more specific link please?
Formal Solution to the Inner Alignment Problem

When talking about uniform (worst-case) bounds, realizability just means the true environment is in the hypothesis class, but in a Bayesian setting (like in the OP) it means that our bounds scale with the probability of the true environment in the prior. Essentially, it means we can pretend the true environment was sampled from the prior. So, if (by design) training works by sampling environments from the prior, and (by realizability) deployment also consists of sampling an environment from the same prior, training and deployment are indistinguishable.

2Evan Hubinger2moSure—by that definition of realizability, I agree that's where the difficulty is. Though I would seriously question the practical applicability of such an assumption.
Formal Solution to the Inner Alignment Problem

Okay, but why? I think that the reason you have this intuition is, the realizability assumption is false. But then you should concede that the main weakness of the OP is the realizability assumption rather than the difference between deep learning and Bayesianism.

2Evan Hubinger2moPerhaps I just totally don't understand what you mean by realizability, but I fail to see how realizability is relevant here. As I understand it, realizability just says that the true model has some non-zero prior probability—but that doesn't matter (at least for the MAP, which I think is a better model than the full posterior for how SGD actually works) as long as there's some deceptive model with greater prior probability that's indistinguishable on the training distribution, as in my simple toy model from earlier.
Formal Solution to the Inner Alignment Problem

I understand how this model explains why agents become unaligned under distributional shift. That's something I never disputed. However, I don't understand how this model applies to my proposal. In my proposal, there is no distributional shift, because (thanks to the realizability assumption) the real environment is sampled from the same prior that is used for training with . The model can't choose to act deceptively during training, because it can't distinguish between training and deployment. Moreover, the objective I described is not complicated.

2Evan Hubinger2moYeah, that's a fair objection—my response to that is just that I think that preventing a model from being able to distinguish training and deployment is likely to be impossible for anything competitive.
Formal Solution to the Inner Alignment Problem

The reward function I was sketching earlier is not complex. Moreover, if you can encode then you can encode , no reason why the latter should have much greater complexity. If you can't encode but can only encode something that produces , then by the same token you can encode something that produces (I don't think that's even a meaningful distinction tbh). I think it would help if you could construct a simple mathematical toy model of your reasoning here?

2Evan Hubinger2moHere's a simple toy model. Suppose you have two agents that internally compute their actions as follows (perhaps with actual argmax replaced with some smarter search algorithm, but still basically structured as below): Mdeceptive(x)=argmaxaE[∑iUdeceptive(si)|a]Maligned(x)=argmaxaE[∑iUaligned(si)|a] Then, comparing the K-complexity of the two models, we get K(Maligned)−K(Mdeceptive)≈K(Ualigned)−K(Udeceptive) and the problem becomes that both Mdeceptive and Maligned will produce behavior that looks aligned on the training distribution, but Ualigned has to be much more complex. To see this, note that essentially any Udeceptive will yield good training performance because the model will choose to act deceptively during training, whereas if you want to get good training performance without deception, then Ualigned has to actually encode the full objective, which is likely to make it quite complicated.
Formal Solution to the Inner Alignment Problem

Why? It seems like would have all the knowledge required for achieving good rewards of the good type, so simulating should not be more difficult than achieving good rewards of the good type.

2Evan Hubinger2moIt's not that simulating M is difficult, but that encoding for some complex goal is difficult, whereas encoding for a random, simple goal is easy.
Formal Solution to the Inner Alignment Problem

IIUC, you're saying something like: suppose trained- computes the source code of the complex core of and then runs it. But then, define as: compute the source code of the complex core of (in the same way does it) and use it to implement . is equivalent to and has about the same complexity as trained-.

Or, from a slightly different angle: if "think up good strategies for achieving " is powerful enough to come up with , then "think up good strategies for achieving [reward of the type I defined earlier]" is powerful enough to come up with .... (read more)

2Evan Hubinger2moI think that “think up good strategies for achieving [reward of the type I defined earlier]” is likely to be much, much more complex (making it much more difficult to achieve with a local search process) than an arbitrary goal X for most sorts of rewards that we would actually be happy with AIs achieving.
Formal Solution to the Inner Alignment Problem

Hmm, sorry, I'm not following. What exactly do you mean by "inference-time" and "derived"? By assumption, when you run on some sequence it effectively simulates which runs the complex core of . So, trained on that sequence effectively contains the complex core of as a subroutine.

2Evan Hubinger2moA's structure can just be “think up good strategies for achieving X, then do those,” with no explicit subroutine that you can find anywhere in A's weights that you can copy over to B.
Formal Solution to the Inner Alignment Problem

Sure, but then I think B is likely to be significantly more complex and harder for a local search process to find than A.

is sufficiently powerful to select which contains the complex part of . It seems rather implausible that an algorithm of the same power cannot select .

if we look at actually successful current architectures, like CNNs or transformers, they're designed to work well on specific types of data and relationships that are common in our world—but not necessarily at all just in a general simplicity prior.

CNNs are specific in some wa... (read more)

3Evan Hubinger2moA's weights do not contain the complex part of B—deception is an inference-time phenomenon. It's very possible for complex instrumental goals to be derived from a simple structure such that a search process is capable of finding that simple structure that yields those complex instrumental goals without being able to find a model with those complex instrumental goals hard-coded as terminal goals.
Formal Solution to the Inner Alignment Problem

If we let A = SGD or A = evolution, your first claim becomes “if SGD/evolution finds a malign model, it must understand that it's malign on some level,” which seems just straightforwardly incorrect.

To me it seems straightforwardly correct! Suppose you're running evolution in order to predict a sequence. You end up evolving a mind M that is a superintelligent malign consequentialist: it makes good predictions on purpose, in order to survive, and then produces a malign false prediction at a critical moment. So, M is part of the state of your algorithm A. ... (read more)

3Evan Hubinger2moSure, but then I think B is likely to be significantly more complex and harder for a local search process to find than A. I definitely don't think this, unless you have a very strong (and likely unachievable imo) definition of “good.” I guess I'm skeptical that you can do all that much in the fully generic setting of just trying to predict a simplicity prior. For example, if we look at actually successful current architectures, like CNNs or transformers, they're designed to work well on specific types of data and relationships that are common in our world—but not necessarily at all just in a general simplicity prior.
Formal Solution to the Inner Alignment Problem

Hmm, maybe you misunderstood my proposal? I suggested to train the model by meta-learning on purely synthetic data, sampled from some kind of simplicity prior, without any interface to the world. Maybe you just think this wouldn't be competitive? If so, why? Is the argument just that there are no existing systems like this? But then it's weak evidence at best. On the contrary, even from a purely capability standpoint, meta-learning with synthetic data might be a promising strategy to lower deep learning's sample complexity.

Here's why I think it will be com... (read more)

2Evan Hubinger2moThis seems very sketchy to me. If we let A = SGD or A = evolution, your first claim becomes “if SGD/evolution finds a malign model, it must understand that it's malign on some level,” which seems just straightforwardly incorrect. The last claim also seems pretty likely to be wrong if you let C = SGD or C = evolution. Moreover, it definitely seems like training on data sampled from a simplicity prior (if that's even possible—it should be uncomputable in general) is unlikely to help at all. I think there's essentially no realistic way that training on synthetic data like that will be sufficient to produce a model which is capable of accomplishing things in the world. At best, that sort of approach might give you better inductive biases in terms of incentivizing the right sort of behavior, but in practice I expect any impact there to basically just be swamped by the impact of fine-tuning on real-world data.
Formal Solution to the Inner Alignment Problem

I expect it to be extremely difficult if not impossible to prevent a model that you want to be able to operate in the real world from being able to determine at what point in training/deployment it's in.

Why do you expect it? During training, it finds itself in random universe. During deployment, it finds itself in another universe drawn from the same prior (the realizability assumption). How would it determine the difference?

4Evan Hubinger2moBecause that's never what machine learning actually looks like in practice—essentially any access to text from the internet (let alone actual ability to interface with the world, both of which seem necessary for competitive systems) will let it determine things like the year, whether RSA-2048 has been factored, or other various pieces of information that are relevant to what stage in training/testing/deployment it's in, how powerful the feedback mechanisms keeping it in check are, whether other models are defecting, etc. that can let it figure out when to defect.
Formal Solution to the Inner Alignment Problem

I think that adversarial training working so well that it can find exponentially rare failures is an unnecessarily strong desideratum. We need to drive to probability of catastrophic failures to something very low, but not actually zero. If a system is going to work for time during deployment, then running it in different synthetic environments for time during training is enough to drive the probability of failure down to . Now, this is prohibitively costly, but it's not exponentially prohibitively costly. And this doesn't use adversarial traini... (read more)

3Evan Hubinger2moI agree that random defection can potentially be worked around—but the RSA-2048 problem is about conditional defection, which can't be dealt with in the same way. More generally, I expect it to be extremely difficult if not impossible to prevent a model that you want to be able to operate in the real world from being able to determine at what point in training/deployment it's in.
Formal Solution to the Inner Alignment Problem

We're doing meta-learning. During training, the network is not learning about the real world, it's learning how to be a safe predictor. It's interacting with a synthetic environment, so a misprediction doesn't have any catastrophic effects: it only teaches the algorithm that this version of the predictor is unsafe. In other words, the malign subagents have no way to attack during training because they can access little information about what the real universe is like. The training process is designed to select predictors that only make predictions when they can be confident, and the training performance allows us to verify this goal has truly been achieved.

Formal Solution to the Inner Alignment Problem

Right, but but you can look at the performance of your model in training, compare it to the theoretical optimum (and to the baseline of making no predictions at all) and get lots of evidence about safety from that. You can even add some adversarial training of the synthetic environment in order to get tighter bounds. If on the vast majority of synthetic environments your model makes virtually no mispredictions, then, under the realizability assumption, it is very unlikely to make mispredictions in deployment. Ofc the realizability assumption should also be questioned: but that's true in the OP as well, so it's not a difference between Bayesianism and deep.

2Evan Hubinger2moNote that adversarial training doesn't work on deceptive models due to the RSA-2048 problem [https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d]; also see more detail here [https://www.alignmentforum.org/posts/9Dy5YRaoCxH9zuJqa/relaxed-adversarial-training-for-inner-alignment#Hard_cases] . I think realizability basically doesn't help here—as long as there's a deceptive model which is easier to find according to your inductive biases, the fact that somewhere in the model space there exists a correct model, but not one that your local search process finds by default, is cold comfort.
1michaelcohen2moWhat's the distinction between training and deployment when the model can always query for more data?
Formal Solution to the Inner Alignment Problem

You have no guarantees, sure, but that's a problem with deep learning in general and not just inner alignment. The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal. To the extent your algorithm is able to approximate the true optimum during training, it will behave safely during deployment.

2Evan Hubinger3moI think you are understanding inner alignment very differently than we define it in Risks from Learned Optimization [https://www.alignmentforum.org/s/r9tYkB2a8Fp4DN8yB], where we introduced the term. This is not true for deceptively aligned models, which is the situation I'm most concerned about, and—as we argue extensively in Risks from Learned Optimization [https://www.alignmentforum.org/s/r9tYkB2a8Fp4DN8yB]—there are a lot of reasons why a model might end up pursuing a simpler/faster/easier-to-find proxy even if that proxy yields suboptimal training performance.
Formal Solution to the Inner Alignment Problem

Here's another way how you can try implementing this approach with deep learning. Train the predictor using meta-learning on synthetically generated environments (sampled from some reasonable prior such as bounded Solomonoff or maybe ANNs with random weights). The reward for making a prediction is , where is the predicted probability of outcome , is the true probability of outcome and are parameters. The reward for making no prediction (i.e. querying the user) is .

This particular proposal is probably not quite right, ... (read more)

3Evan Hubinger3moSure, but you have no guarantee that the model you learn is actually going to be optimizing anything like that reward function—that's the whole point of the inner alignment problem. What's nice about the approach in the original paper is that it keeps a bunch of different models around, keeps track of their posterior, and only acts on consensus, ensuring that the true model always has to approve. But if you just train a single model on some reward function like that with deep learning, you get no such guarantees.
Formal Solution to the Inner Alignment Problem

Here's the sketch of a solution to the query complexity problem.

Simplifying technical assumptions:

  • The action space is
  • All hypotheses are deterministic
  • Predictors output maximum likelihood predictions instead of sampling the posterior

I'm pretty sure removing those is mostly just a technical complication.

Safety assumptions:

  • The real hypothesis has prior probability lower bounded by some known quantity , so we discard all hypotheses of probability less than from the onset.
  • Malign hypotheses have total prior probability mass upper bounded by some
... (read more)
6Paul Christiano2moI agree that this settles the query complexity question for Bayesian predictors and deterministic humans. I expect it can be generalized to have complexityO(ϵ2δ2)in the case with stochastic humans where treacherous behavior can take the form of small stochastic shifts. I think that the big open problems for this kind of approach to inner alignment are: * Isϵδbounded? I assign significant probability to it being2100or more, as mentioned in the other thread between me and Michael Cohen, in which case we'd have trouble. (I believe this is also Eliezer's view.) * It feels like searching for a neural network is analogous to searching for a MAP estimate, and that more generally efficient algorithms are likely to just run one hypothesis most of the time rather than running them all. Then this algorithm increases the cost of inference byϵδwhich could be a big problem. * As you mention, is it safe to wait and defer or are we likely to have a correlated failure in which all the aligned systems block simultaneously? (e.g. as described here [https://ordinaryideas.wordpress.com/2015/11/30/driving-fast-in-the-counterfactual-loop/] )
3michaelcohen3moThis is very nice and short! And to state what you left implicit: Ifp0>p1+ε, then in the setting with no malign hypotheses (which you assume to be safe), 0 is definitely the output, since the malign models can only shift the outcome byε, so we assume it is safe to output 0. And likewise with outputting 1. One general worry I have about assuming that the deterministic case extends easily to the stochastic case is that a sequence of probabilities that tends to 0 can still have an infinite sum, which is not true when probabilities must∈{0,1 }, and this sometimes causes trouble. I'm not sure this would raise any issues here--just registering a slightly differing intuition.
Formal Solution to the Inner Alignment Problem

I mean, there's obviously a lot more work to do, but this is progress. Specifically if SGD is MAP then it seems plausible that e.g. SGD + random initial conditions or simulated annealing would give you something like top N posterior models. You can also extract confidence from NNGP.

3Evan Hubinger3moI agree that this is progress (now that I understand it better), though: I think there is strong evidence that the behavior of models trained via the same basic training process are likely to be highly correlated. This sort of correlation is related to low variance in the bias-variance tradeoff sense, and there is evidence that [https://arxiv.org/abs/2002.11328] not only do massive neural networks tend to have pretty low variance, but that this variance is likely to continue to decrease as networks become larger.
Load More