# All of Vanessa Kosoy's Comments + Replies

The Reasonable Effectiveness of Mathematics or: AI vs sandwiches

In this post I speculated on the reasons for why mathematics is so useful so often, and I still stand behind it. The context, though, is the ongoing debate in the AI alignment community between the proponents of heuristic approaches and empirical research[1] ("prosaic alignment") and the proponents of building foundational theory and mathematical analysis (as exemplified in MIRI's "agent foundations" and my own "learning-theoretic" research agendas).

Previous volleys in this debate include Ngo's "realism about rationality" (on the anti-theory side), the pro... (read more)

Clarifying inner alignment terminology

This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation[1] and making the concepts precise at least in some very simplistic toy model.

In the following, I'll try going over some of the definitions and explicating my unde... (read more)

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

This post states a subproblem of AI alignment which the author calls "the pointers problem". The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user's model to the AI's model? This is very similar to the "ontological crisis" problem described by De Blanc, only De Blanc ... (read more)

Inaccessible information

This post defines and discusses an informal notion of "inaccessible information" in AI.

AIs are expected to acquire all sorts of knowledge about the world in the course of their training, including knowledge only tangentially related to their training objective. The author proposes to classify this knowledge into "accessible" and "inaccessible" information. In my own words, information inside an AI is "accessible" when there is a straightforward way to set up a training protocol that will incentivize the AI to reliably and accurately communicate this inform... (read more)

The Solomonoff Prior is Malign

Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence.

I think it's just a symptom of not actually knowing metacosmology. Imagine that metacosmology could explain detailed properties of our laws of physics (such as the precise values of certain constants) via the simulation hypothesis for which no other explanation exists.

my assumption is that the programmers won't have such fine-grained control over the AGI's cognition / hypothesis space

The Solomonoff Prior is Malign

It seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.

Why? Maybe you're thinking of UDT? In which case, it's sort of true but IBP is precisely a formalization of UDT + extra nuance regarding the input of the utility function.

I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent.

Well, IBP is explained here. I'm not sure what kind of non-IBP agent yo... (read more)

The Solomonoff Prior is Malign

In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I'm pretty confident that there's no metacosmological argument that will motivate me to stab my family members.

Suppose your study of metacosmology makes you highly confident of the following: You are in a simulation. If you don't stab your family members, you and your family members will be sent by the simulators into hell. If you do stab your family members, they will come back to life and all of you will be sent to heaven. Yes, it's still counterintuitive to stab them f... (read more)

4Steve Byrnes15d(Warning: thinking out loud.) Hmm. Good points. Maybe I have a hard time relating to that specific story because it's hard for me to imagine believing any metacosmological or anthropic argument with >95% confidence. Even if within that argument, everything points to "I'm in a simulation etc.", there's a big heap of "is metacosmology really what I should be thinking about?"-type uncertainty on top. At least for me. I think "people who do counterintuitive things" for religious reasons usually have more direct motivations—maybe they have mental health issues and think they hear God's voice in their head, telling them to do something. Or maybe they want to fit in, or have other such social motivations, etc. Hmm, I guess this conversation is moving me towards a position like: "If the AGI thinks really hard about the fundamental nature of the universe / metaverse, anthropics, etc., it might come to have weird beliefs, like e.g. the simulation hypothesis, and honestly who the heck knows what it would do. Better try to make sure it doesn't do that kind of (re)thinking, at least not without close supervision and feedback." Your approach (I think) is instead to plow ahead into the weird world of anthopics, and just try to ensure that the AGI reaches conclusions we endorse. I'm kinda pessimistic about that. For example, your physicalism post was interesting, but my assumption is that the programmers won't have such fine-grained control over the AGI's cognition / hypothesis space. For example, I don't think the genome bakes in one formulation of "bridge rules" over another in humans; insofar as we have (implicit or explicit) bridge rules at all, they emerge from a complicated interaction between various learning algorithms and training data and supervisory signals. (This gets back to things like whether we can get good hypotheses without a learning agent that's searching for good hypotheses, and whether we can get good updates without a learning agent that's searching for
The Solomonoff Prior is Malign

This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it.

Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of rea... (read more)

2Steve Byrnes15dThanks for that! But I share with OP the intuition that these are weird failure modes that come from weird reasoning. More specifically, it's weird from the perspective of human reasoning. It seems to me that your story is departing from human reasoning when you say "you posses a great desire to help whomever has summoned you into the world". That's one possible motivation, I suppose. But it wouldn't be a typical human motivation. The human setup is more like: you get a lot of unlabeled observations and assemble them into a predictive world-model, and you also get a lot of labeled examples of "good things to do", one way or another, and you pattern-match them to the concepts in your world-model. So you wind up having a positive association with "helping El'Azar", i.e. "I want to help El'Azar". AND you wind up with a positive association with "helping my summoner", i.e. "I want to help my summoner". AND you have a positive association with "fixing the cosmos", i.e. "I want to fix the cosmos". Etc. Normally all those motivations point in the same direction: helping El'Azar = helping my summoner = fixing the cosmos. But sometimes these things come apart, a.k.a. model splintering [https://www.alignmentforum.org/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1] . Maybe I come to believe that El'Azar is not "my summoner". You wind up feeling conflicted—you start having ideas that seem good in some respects and awful in other respects. (e.g. "help my summoner at the expense of El'Azar".) In humans, the concrete and vivid tends to win out over the abstruse hypotheticals—I'm pretty confident that there's no metacosmological argument that will motivate me to stab my family members. Why not? Because rewards tend to pattern-match very strongly to "my family member, who is standing right here in front of me", and tend to pattern-match comparatively weakly to abstract mathematical concepts many steps removed from my experience. So my de
The Solomonoff Prior is Malign

It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there.

Why is embededness not enough? Once you don't have bridge rules, what is left is the laws of physics. What does the malign hypothesis explain about the laws of physics that the true hypothesis doesn't explain?

I suspect (but ... (read more)

2Paul Christiano15dIt seems like any approach that evaluates policies based on their consequences is fine, isn't it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable. I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?
The Solomonoff Prior is Malign

I'd stand by saying that it doesn't appear to make the problem go away.

Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.

That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses

I'm not sure I understand what you mean by "decision-theoretic approach". This attack vector has structure similar to acausal bargaining (between the AI and the attacker), so plausibly some decision theories ... (read more)

4Paul Christiano16dIt seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don't work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn't enough to get you there. I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences (and if you try to define utility in terms of solomonoff induction applied to your experiences, e.g. by learning a human, then it seems again vulnerable to attack bridging hypotheses or no). I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it's not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).
The Solomonoff Prior is Malign

I don't have a clear picture of how handling embededness or reflection would make this problem go away, though I haven't thought about it carefully.

Infra-Bayesian physicalism does ameliorate the problem by handling "embededness". Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn't get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.

that all said I agree that the malignness of the universal prior is unlikely to

4Paul Christiano16dI agree that removing bridge hypotheses removes one of the advantages for malign hypotheses. I didn't mention this because it doesn't seem like the way in which john is using "embededness;" for example, it seems orthogonal to the way in which the situation violates the conditions for solomonoff induction to be eventually correct. I'd stand by saying that it doesn't appear to make the problem go away. That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses (since otherwise they also get big benefits from the influence update). And then once you've done that in a sensible way it seems like it also addresses any issues with embededness (though maybe we just want to say that those are being solved inside the decision theory). If you want to recover the expected behavior of induction as a component of intelligent reasoning (rather than a component of the utility function + an instrumental step in intelligent reasoning) then it seems like you need a more different tack. If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources. If you do the same induction but just remove the malign hypotheses, then it seems like you are even dumber and the problem is even worse viewed from the competitiveness perspective.
An Orthodox Case Against Utility Functions

In this post, the author presents a case for replacing expected utility theory with some other structure which has no explicit utility function, but only quantities that correspond to conditional expectations of utility.

To provide motivation, the author starts from what he calls the "reductive utility view", which is the thesis he sets out to overthrow. He then identifies two problems with the view.

The first problem is about the ontology in which preferences are defined. In the reductive utility view, the domain of the utility function is the set of possib... (read more)

Infra-Bayesian physicalism: a formal theory of naturalized induction

No, it's not a baseline, it's just an inequality. Let's do a simple example. Suppose the agent is selfish and cares only about (i) the experience of being in a red room and (ii) the experience of being in a green room. And, let's suppose these are the only two possible experiences, it can't experience going from a room in one color to a room in another color or anything like that (for example, because the agent has no memory). Denote the program corresponding to "the agent deciding on an action after it sees a green room" and the program corresponding ... (read more)

Vanessa Kosoy's Shortform

Yes, there is some similarity! You could say that a Hippocratic AI needs to be continuously non-obstructive w.r.t. the set of utility functions and priors the user could plausibly have, given what the AI knows. Where, by "continuously" I mean that we are allowed to compare keeping the AI on or turning off at any given moment.

Infra-Bayesian physicalism: a formal theory of naturalized induction

Should this say "elements are function... They can be thought of as...?"

Yes, the phrasing was confusing, I fixed it, thanks.

Can you make a similar theory/special case with probability theory, or do you really need infra-bayesianism?

We really need infrabayesianism. On bayesian hypotheses, the bridge transform degenerates: it says that, more or less, all programs are always running. And, the counterfactuals degenerate too, because selecting most policies would produce "Nirvana".

The idea is, you must have Knightian uncertainty about the result of a pr... (read more)

Infra-Bayesian physicalism: a formal theory of naturalized induction

Could you explain what the monotonicity principle is, without referring to any symbols or operators?

The loss function of a physicalist agent depends on which computational facts are physically manifest (roughly speaking, which computations the universe runs), and on the computational reality itself (the outputs of computations). The monotonicity principle requires it to be non-decreasing w.r.t. the manifesting of less facts. Roughly speaking, the more computations the universe runs, the better.

This is odd, because it implies that the total destruction o... (read more)

1Jon Garcia20dI think this is what I was missing. Thanks. So, then, the monotonicity principle sets a baseline for the agent's loss function that corresponds to how much less stuff can happen to whatever subset of the universe it cares about, getting worse the fewer opportunities become available, due to death or some other kind of stifling. Then the agent's particular value function over universe-states gets added/subtracted on top of that, correct?
Morality is Scary

I have low confidence about this, but my best guess personal utopia would be something like: A lot of cool and interesting things are happening. Some of them are good, some of them are bad (a world in which nothing bad ever happens would be boring). However, there is a limit on how bad something is allowed to be (for example, true death, permanent crippling of someone's mind and eternal torture are over the line), and overall "happy endings" are more common than "unhappy endings". Moreover, since it's my utopia (according to my understanding of the questio... (read more)

Vanessa Kosoy's Shortform

The above threat model seems too paranoid: it is defending against an adversary that sees the trained model and knows the training algorithm. In our application, the model itself is either dangerous or not independent of the training algorithm that produced it.

Let be our accuracy requirement for the target domain. That is, we want s.t.

Given any , denote to be conditioned on the inequality above, where is regarded as a random variable. Define by

2021 AI Alignment Literature Review and Charity Comparison

Notice that in MIRI's summary of 2020 they wrote "From our perspective, our most interesting public work this year is Scott Garrabrant’s Cartesian frames model and Vanessa Kosoy’s work on infra-Bayesianism."

2021 AI Alignment Literature Review and Charity Comparison

I noticed that you didn't mention infra-Bayesianism, not in 2020 and not this year. Any particular reason?

6Larks1mo* I prioritized posts by named organizations. * Diffractor does not list any institutional affiliations on his user page [https://www.lesswrong.com/users/diffractor]. * No institution I noticed listed the post/sequence on their 'research' page. * No institution I contacted mentioned the post/sequence. * No post in the sequence was that high in the list of 2021 Alignment Forum posts, sorted by karma [https://www.alignmentforum.org/allPosts?sortedBy=top&timeframe=yearly]. * Several other filtering methods also did not identify the post However upon reflection it does seem to be MIRI-affiliated so perhaps should have been affiliated; if I have time I may review and edit it in later.
Alignment By Default

...the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent

Of course, but this in itself is no consolation, because it can spend its finite influence to make the AI perform an irreversible catastrophic action: for example, self-modifying into something explicitly malign.

In e.g. IDA-type protocols you can defend by using a good prior (such as IB physicalism) plus confidence t... (read more)

Alignment By Default

I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that "human values" themselves are natural abstractions

That's fair, but it's still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity.

...learning a human utility f

2johnswentworth1moYup, that's right. I still agree with your general understanding, just wanted to clarify the subtlety. Yup, I agree with all that. I was specifically talking about IRL approaches which try to learn a utility function, not the more general possibility space. The distinction there is about whether or not there's an actual agent in the external environment which coordinates acausally with the malign inner agent, or some structure in the environment which allows for self-fulfilling prophecies, or something along those lines. The point is that there has to be some structure in the external environment which allows a malign inner agent to gain influence over time by making accurate predictions. Otherwise, the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent; it will end up with zero influence in the long run.
Alignment By Default

In this post, the author describes a pathway by which AI alignment can succeed even without special research effort. The specific claim that this can happen "by default" is not very important, IMO (the author himself only assigns 10% probability to this). On the other hand, viewed as a technique that can be deliberately used to help with alignment, this pathway is very interesting.

The author's argument can be summarized as follows:

• For anyone trying to predict events happening on Earth, the concept of "human values" is a "natural abstraction", i.e. someth
5johnswentworth1moOne subtlety which approximately 100% of people I've talked to about this post apparently missed: I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that "human values" themselves are natural abstractions; values vary a lot more across cultures than e.g. agreement on "trees" as a natural category. In the particular section you quoted, I'm explicitly comparing the best-case of abstraction by default to the the other two strategies, assuming that the other two work out about-as-well as they could realistically be expected to work. For instance, learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can't do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions. Obviously alignment by default has analogous assumptions/flaws; much of the OP is spent discussing them. The particular section you quote was just talking about the best-case where those assumptions work out well. I partially agree with this, though I do think there are good arguments that malign simulation issues will not be a big deal (or to the extent that they are, they'll look more like Dr Nefarious [https://www.lesswrong.com/posts/a7jnbtoKFyvu5qfkd/formal-inner-alignment-prospectus?commentId=fqiEhE99nxC2BEKPe] than pure inner daemons), and by historical accident those arguments have not been circulated in this community to nearly the same extent as the arguments that malign simulations will be a big deal. Some time in the next few weeks I plan to write a review of The Solomonoff Prior Is Malign [https://www.lesswrong.com/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is
Introduction To The Infra-Bayesianism Sequence

Notice that some non-worst-case decision rules are reducible to the worst-case decision rule.

The ground of optimization

In this post, the author proposes a semiformal definition of the concept of "optimization". This is potentially valuable since "optimization" is a word often used in discussions about AI risk, and much confusion can follow from sloppy use of the term or from different people understanding it differently. While the definition given here is a useful perspective, I have some reservations about the claims made about its relevance and applications.

The key paragraph, which summarizes the definition itself, is the following:

An optimizing system is a system that

Vanessa Kosoy's Shortform

The concept of corrigibility was introduced by MIRI, and I don't think that's their motivation? On my model of MIRI's model, we won't have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is "we won't know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure". Which, sure, but I don't see what it has to do with corrigibility.

Corrigibility is neither ne... (read more)

There is essentially one best-validated theory of cognition.

Hi Terry, can you recommend an introduction for people with mathematics / theoretical computer science background? I glanced at the paper you linked but it doesn't seem to have a single equation, mathematical statement or pseudocode algorithm. There are diagrams, but I have no idea what the boxes and arrows actually represent.

1[comment deleted]1mo

Hi Vanessa,  hmm, very good question.  One possibility is to point you at the ACT-R reference manual http://act-r.psy.cmu.edu/actr7/reference-manual.pdf but that's a ginormous document that also spends lots of time just talking about implementation details, because the reference ACT-R implementation is in Lisp (yes, ACT-R has been around that long!)

So, another option would be this older paper of mine, where I attempted to rewrite ACT-R in Python, and so the paper goes through the math that had to be reimplemented. http://act-r.psy.cmu.edu/wordpre... (read more)

Vanessa Kosoy's Shortform

There's a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD, which we call autocalibrating quantilized RL (AQRL).

First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn't contain t... (read more)

Vanessa Kosoy's Shortform

Master post for alignment protocols.

Other relevant shortforms:

3Vanessa Kosoy2moThere's a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=h3Ww6nyt9fpj7BLyo] , which we call autocalibrating quantilized RL (AQRL). First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn't contain terms such as "oh btw don't kill people while you're building the nanosystem". However, suppose the task is s.t. accomplishing it in the intended way (without Goodharting or causing catastrophic side effects) is easier than performing any attack. We will call this the "relative difficulty assumption" (RDA). Then, there exists a value for the quantilization parameter s.t. quantilized RL performs the task in the intended way. We might not know how to set the quantilization parameter on our own, but we can define a performance goal for the task (in terms of expected total reward) s.t. the RDA holds. This leads to algorithms which gradually tune the quantilization parameter until the performance goal is met, while maintaining a proper balance between safety and sample complexity. Here it is important to keep track of epistemic vs. aleatoric uncertainty: the performance goal is the expectation of total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a given hypothesis), whereas the safety goal is a bound on the expected cost of overshooting the optimal quantilization parameter relatively to both aleatoric and epistemic uncertainty (i.e. uncertainty between different hypotheses). This secures the system against malign hypotheses that are trying to cause an overshoot. Notice the hardenin
Considerations on interaction between AI and expected value of the future

My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.

1Vladimir Slepnev2moTo me it feels like alignment is a tiny target to hit, and around it there's a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don't see why that's true.
Morality is Scary

Just to be clear, this isn't in response to something I wrote, right? (I'm definitely not advocating any kind of "utilitarian TAI project" and would be quite scared of such a project myself.)

No! Sorry, if I gave that impression.

So what are you (and them) then? What would your utopia look like?

2Wei Dai1moYeah, I mean aside from how much you care about various other people, what concrete things do you want in your utopia?
Morality is Scary

My worry wasn't about the initial 10%, but about the possibility of the process being iterated such that you end up with almost all bargaining power in the hands of power-keepers.

I'm not sure what you mean here, but also the process is not iterated: the initial bargaining is deciding the outcome once and for all. At least that's the mathematical ideal we're approximating.

In the end, I think my concern is that we won't get buy-in from a large majority of users: In order to accommodate some proportion with odd moral views it seems likely you'll be throw

1Joe_Collman2moAh, I was just being an idiot on the bargaining system w.r.t. small numbers of people being able to hold it to ransom. Oops. Agreed that more majority power isn't desirable. [re iteration, I only meant that the bargaining could become iterated if the initial bargaining result were to decide upon iteration (to include more future users). I now don't think this is particularly significant.] I think my remaining uncertainty (/confusion) is all related to the issue I first mentioned (embedded copy experiences). It strikes me that something like this can also happen where minds grow/merge/overlap. Does this avoid the problem ifi's preferences use indirection? It seems to me that a robust pointer tojmay be enough: that with a robust pointer it may be possible to implicitly require something like source-code-access without explicitly referencing it. E.g. whereihas a preference to "experiencejsuffering in circumstances where there's strong evidence it's actuallyjsuffering, given that these circumstances were the outcome of this bargaining process". Ifican't robustly specify things like this, then I'd guess there'd be significant trouble in specifying quite a few (mutually) desirable situations involving other users too. IIUC, this would only be any problem for the denosed bargaining to find a goodd1: for the second bargaining on the true utility functions there's no need to put anything "out of scope" (right?), so win-wins are easily achieved.
Morality is Scary

This is not a theory that's familiar to me. Why do you think this is true? Have you written more about it somewhere or can link to a more complete explanation?

I considering writing about this for a while, but so far I don't feel sufficiently motivated. So, the links I posted upwards in the thread are the best I have, plus vague gesturing in the directions of Hansonian signaling theories, Jaynes' theory of consciousness and Yudkowsky's belief in belief.

Considerations on interaction between AI and expected value of the future

I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:

• Goodharting some proxy, such as making the reward signal go on instead of satisfying the human's request in order for the human to press the reward button. This usually produces a universe without people, since specifying a "person" is fairly complicated and the proxy will not be robustly tied to this concept.
• Allowing a daemon to take over. Daemonic utility function are probably completely alien and also produce a universe without people. One caveat is: maybe t
1Vladimir Slepnev2moThese involve extinction, so they don't answer the question what's the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.
Considerations on interaction between AI and expected value of the future

I'm surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?

2Vladimir Slepnev2moI think alignment is finicky, and there's a "deep pit around the peak" as discussed here [https://www.lesswrong.com/posts/3WMscsscLEavkTJXv/s-risks-why-they-are-the-worst-existential-risks-and-how-to#Rm43SocTknAjp8xAo] .
Morality is Scary

I want to add a little to my stance on utilitarianism. A utilitarian superintelligence would probably kill me and everyone I love, because we are made of atoms that could be used for minds that are more hedonic[1][2][3]. Given a choice between paperclips and utilitarianism, I would still choose utilitarianism. But, if there was a utilitarian TAI project along with a half-decent chance to do something better (by my lights), I would actively oppose the utilitarian project. From my perspective, such a project is essentially enemy combatants.

1. One way to avo

3Wei Dai2moThis seems like a reasonable concern about some types of hedonic utilitarianism. To be clear, I'm not aware of any formulation of utilitarianism that doesn't have serious issues, and I'm also not aware of any formulation of any morality that doesn't have serious issues. Just to be clear, this isn't in response to something I wrote, right? (I'm definitely not advocating any kind of "utilitarian TAI project" and would be quite scared of such a project myself.) So what are you (and them) then? What would your utopia look like?
Morality is Scary

Yes, it's not a very satisfactory solution. Some alternative/complementary solutions:

• Somehow use non-transformative AI to do my mind uploading, and then have the TAI to learn by inspecting the uploads. Would be great for single-user alignment as well.
• Somehow use non-transformative AI to create perfect lie detectors, and use this to enforce honesty in the mechanism. (But, is it possible to detect self-deception?)
• Have the TAI learn from past data which wasn't affected by the incentives created by the TAI. (But, is there enough information there?)
• Shape t
Morality is Scary

Ah - that's cool if IB physicalism might address this kind of thing

I admit that at this stage it's unclear because physicalism brings in the monotonicity principle that creates bigger problems than what we discuss here. But maybe some variant can work.

For instance, suppose initially 90% of people would like to have an iterated bargaining process that includes future (trans/post)humans as users, once they exist. The other 10% are only willing to accept such a situation if they maintain their bargaining power in future iterations (by whatever mechanism)

1Joe_Collman2moSure, I'm not sure there's a viable alternative either. This kind of approach seems promising - but I want to better understand any downsides. My worry wasn't about the initial 10%, but about the possibility of the process being iterated such that you end up with almost all bargaining power in the hands of power-keepers. In retrospect, this is probably silly: if there's a designable-by-us mechanism that better achieves what we want, the first bargaining iteration should find it. If not, then what I'm gesturing at must either be incoherent, or not endorsed by the 10% - so hard-coding it into the initial mechanism wouldn't get the buy-in of the 10% to the extent that they understood the mechanism. In the end, I think my concern is that we won't get buy-in from a large majority of users: In order to accommodate some proportion with odd moral views it seems likely you'll be throwing away huge amounts of expected value in others' views - if I'm correctly interpreting your proposal (please correct me if I'm confused). Is this where you'd want to apply amplified denosing? So, rather than filtering out the undesirablei, for theseiyou use: (Diu)(xi,x−i,y):=maxx′∈∏j≠iXj,y′∈Yu(xi,x′,y′)[i.e. ignoring y and imagining it's optimal] However, it's not clear to me how we'd decide who gets strong denosing (clearly not everyone, or we don't pick ay). E.g. if you strong-denose anyone who's too willing to allow bargaining failure [everyone dies] you might end up filtering out altruists who worry about suffering risks. Does that make sense?
Biology-Inspired AGI Timelines: The Trick That Never Works

I didn't ask how much, I asked what does it even mean. I think I understand the principles of Cotra's report. What I don't understand is why should we believe the "neural anchor" when (i) modern algorithms applied to a brain-sized ANN might not produce brain-performance and (ii) the compute cost of future algorithms might behave completely differently. (i.e. I don't understand how Carl's and Mark's arguments in this thread protect the neural anchor from Yudkowsky's criticism.)

2Daniel Kokotajlo2moThese are three separate things: (a) What is the meaning of "2020-FLOPS-equivalent that TAI needs?" (b) Can you build TAI with 2020 algorithms without some truly astronomical amount of FLOPs? (c) Why should we believe the "neural anchor?" (a) is answered roughly in my linked post and in much more detail and rigor in Ajeya's doc. (b) depends on what you mean by truly astronomical; I think it would probably be doable for 10^35, Ajeya thinks 50% chance. For (c), I actually don't think we should put that much weight on the "neural anchor," and I don't think Ajeya's framework requires that we do (although, it's true, most of her anchors do center on this human-brain-sized ANN scenario which indeed I think we shouldn't put so much weight on.) That said, I think it's a reasonable anchor to use, even if it's not where all of our weight should go. This post [https://www.lesswrong.com/posts/HhWhaSzQr6xmBki8F/birds-brains-planes-and-ai-against-appeals-to-the-complexity] gives some of my intuitions about this. Of course Ajeya's report says a lot more.
Biology-Inspired AGI Timelines: The Trick That Never Works

I don't understand this.

• What is the meaning of "2020-FLOPS-equivalent that TAI needs"? Plausibly you can't build TAI with 2020 algorithms without some truly astronomical amount of FLOPs.
• What is the meaning of "20XX-FLOPS convert to 2020-FLOPS-equivalent"? If 2020 algorithms hit DMR, you can't match a 20XX algorithm with a 2020 algorithm without some truly astronomical amount of FLOPs.

Maybe you're talking about extrapolating the compute-performance curve, assuming that it stays stable across algorithmic paradigms (although, why would it??) However, in ... (read more)

2Daniel Kokotajlo2moI think 10^35 would probably be enough. This post [https://www.lesswrong.com/posts/rzqACeBGycZtqCfaX/fun-with-12-ooms-of-compute] gives some intuition as to why, and also goes into more detail about what 2020-flops-equivalent-that-TAI-needs means. If you want even more detail + rigor, see Ajeya's report. If you think it's very unlikely that 10^35 would be enough, I'd love to hear more about why -- what are the blockers? Why would OmegaStar, SkunkWorks, etc. described in the post (and all the easily-accessible variants thereof) fail to be transformative? (Also, same questions for APS-AI [https://docs.google.com/document/d/1smaI1lagHHcrhoi6ohdq3TYIZv0eNWWZMPEy8C8byYg/edit#] or AI-PONR [https://www.lesswrong.com/posts/JPan54R525D68NoEt/the-date-of-ai-takeover-is-not-the-day-the-ai-takes-over] instead of TAI, since I don't really care about TAI)
Morality is Scary

I don't think people determine their values through either process. I think that they already have values, which are to a large extent genetic and immutable. Instead, these processes determine what values they pretend to have for game-theory reasons. So, the big difference between the groups is which "cards" they hold and/or what strategy they pursue, not an intrinsic difference in values.

But also, if we do model values as the result of some long process of reflection, and you're worried about the AI disrupting or insufficiently aiding this process, then t... (read more)

3Wei Dai2moThis is not a theory that's familiar to me. Why do you think this is true? Have you written more about it somewhere or can link to a more complete explanation? This seems reasonable to me. (If this was meant to be an argument against something I said, there may have been anther miscommuncation, but I'm not sure it's worth tracking that down.)
Morality is Scary

I'm moderately sure what my values are, to some approximation. More importantly, I'm even more sure that, whatever my values are, they are not so extremely different from the values of most people that I should wage some kind of war against the majority instead of trying to arrive at a reasonable compromise. And, in the unlikely event that most people (including me) will turn out to be some kind of utilitarians after all, it's not a problem: value aggregation will then produce a universe which is pretty good for utilitarians.

2Wei Dai2moMaybe you're just not part of the target audience of my OP then... but from my perspective, if I determine my values through the kind of process described in the first quote, and most people determine their values through the kind of process described in the second quote, it seems quite likely that the values end up being very different. The kind of solution I have in mind is not "waging war" but for example, solving metaphilososphy and building an AI that can encourage philosophical reflection in humans or enhance people's philosophical abilities. What if you turn out to be some kind of utilitarian but most people don't (because you're more like the first group in the OP and they're more like the second group), or most people will eventually turn out to be some kind of utilitarian in a world without AI, but in a world with AI, this [https://www.lesswrong.com/posts/y5jAuKqkShdjMNZab/morality-is-scary?commentId=vpQixgFTGLoJ2CiJk] will happen?
Biology-Inspired AGI Timelines: The Trick That Never Works

We already know how much compute we have, so we don't need Moravec's projections for this? If Yudkowsky described Moravec's analysis correctly, then Moravec's threshold was crossed in 2008. Or, by "other extrapolations" you mean other estimates of human brain compute? Cotra's analysis is much more recent and IIUC she puts the "lifetime anchor" (a more conservative approach than Moravec's) at about one order of magnitude above the biggest models currently used.

Now, the seeds take time to sprout, but according to Mark's model this time is quite short. So, it seems like this line of reasoning produces a timeline significantly shorter than the Plattian 30 years.

Morality is Scary

I think that a rigorous treatment of such issues will require some variant of IB physicalism (in which the monotonicity problem has been solved, somehow). I am cautiously optimistic that a denosing operator exists there which dodges these problems. This operator will declare both the manifesting and evaluation of the source codes of other users to be "out of scope" for a given user. Hence, a preference of to observe the suffering of would be "satisfied" by observing nearly anything, since the maximization can interpret anything as a simulation of .

1Joe_Collman2moAh - that's cool if IB physicalism might address this kind of thing (still on my to-read list). Agreed that the subjoe thing isn't directly a problem. My worry is mainly whether it's harder to rule outiexperiencing a simulation ofxsubj−suffering, since subjisn't a user. However, if you can avoid the sufferingjs by limiting access to information, the same should presumably work for relevant sub-js. This isn't so clear (to me at least) if: 1. Most, but not all current users want future people to be treated well. 2. Part of being "treated well" includes being involved in an ongoing bargaining process which decides the AI's/future's trajectory. For instance, suppose initially 90% of people would like to have an iterated bargaining process that includes future (trans/post)humans as users, once they exist. The other 10% are only willing to accept such a situation if they maintain their bargaining power in future iterations (by whatever mechanism). If you iterate this process, the bargaining process ends up dominated by users who won't relinquish any power to future users. 90% of initial users might prefer drift over lock-in, but we get lock-in regardless (the disagreement point also amounting to lock-in). Unless I'm confusing myself, this kind of thing seems like a problem. (not in terms of reaching some non-terrible lower bound, but in terms of realising potential) Wherever there's this kind of asymmetry/degradation over bargaining iterations, I think there's an argument for building in a way to avoid it from the start - since anything short of 100% just limits to 0 over time. [it's by no means clear that we do want to make future people users on an equal footing to today's people; it just seems to me that we have to do it at step zero or not at all]
Morality is Scary

First, you wrote "a part of me is actually more scared of many futures in which alignment is solved, than a future where biological life is simply wiped out by a paperclip maximizer." So, I tried to assuage this fear for a particular class of alignment solutions.

Second... Yes, for a utilitarian this doesn't mean "much". But, tbh, who cares? I am not a utilitarian. The vast majority of people are not utilitarians. Maybe even literally no one is an (honest, not self-deceiving) utilitarian. From my perspective, disappointing the imaginary utilitarian is (in i... (read more)

Second… Yes, for a utilitarian this doesn’t mean “much”. But, tbh, who cares? I am not a utilitarian. The vast majority of people are not utilitarians. Maybe even literally no one is an (honest, not self-deceiving) utilitarian. From my perspective, disappointing the imaginary utilitarian is (in itself) about as upsetting as disappointing the imaginary paperclip maximizer.

I'm not a utilitarian either, because I don't know what my values are or should be. But I do assign significant credence to the possibility that something in the vincinity of utilitaria... (read more)

Biology-Inspired AGI Timelines: The Trick That Never Works

Hmm... Interesting. So, this model says that algorithmic innovation is so fast that it is not much of a bottleneck: we always manage to find the best algorithm for given compute relatively quickly after this compute becomes available. Moreover, there is some smooth relation between compute and performance assuming the best algorithm for this level of compute. [EDIT: The latter part seems really suspicious though, why would this relation persist across very different algorithms?] Or at least this is true is "best algorithm" is interpreted to mean "best algo... (read more)

1Mark Xu2moThe way that you would think about NN anchors in my model (caveat that this isn't my whole model): * You have some distribution over 2020-FLOPS-equivalent that TAI needs. * Algorithmic progress means that 20XX-FLOPS convert to 2020-FLOPS-equivalent at some 1:N ratio. * The function from 20XX to the 1:N ratio is relatively predictable, e.g. a "smooth" exponential with respect to time. * Therefore, even though current algorithms will hit DMR, the transition to the next algorithm that has less DMR is also predictably going to be some constant ratio better at converting current-FLOPS to 2020-FLOPS-equivalent. E.g. in (some smallish) parts of my view, you take observations like "AGI will use compute more efficiently than human brains" and can ask questions like "but how much is the efficiency of compute->cognition increasing over time?" and draw that graph and try to extrapolate. Of course, the main trouble is in trying to estimate the original distribution of 2020-FLOPS-equivalent needed for TAI, which might go astray in the way a 1950-watt-equivalent needed for TAI will go astray.
6gwern2moWhat Moravec says [https://jetpress.org/volume1/moravec.htm] is merely that $1k human-level compute will become available in the '2020s', and offers several different trendline extrapolations: only the most aggressive puts us at cheap human-level compute in 2020/2021 (note the units on his graph are in decades). On the other extrapolations, we don't hit cheap human-compute until the end of the decade. He also doesn't commit to how long it takes to turn compute into powerful systems, it's more of a pre-requisite: only once the compute is available can R&D really start, same way that DL didn't start instantly in 2010 when various levels of compute/$ were hit. Seeds take time to sprout, to use his metaphor.
Morality is Scary

Given some assumptions about the domains of the utility functions, it is possible to do better than what I described in the previous comment. Let be the space of possible experience histories[1] of user and the space of everything else the utility functions depend on (things that nobody can observe directly). Suppose that the domain of the utility functions is . Then, we can define the "denosing[2] operator" for user by

Here, is the argument of that ranges in , are the argument... (read more)

5Joe_Collman2moThis is very interesting (and "denosing operator" is delightful). Some thoughts: If I understand correctly, I think there can still be a problem where useriwants an experience history such that part of the history is isomorphic to a simulation of userjsuffering (iwants to fully experiencejsuffering in every detail). Here a fixedximay entail some fixedxjfor (some copy of) somej. It seems the above approach can't then avoid leaving one ofiorjbadly off: Ifiis permitted to freely determine the experience of the embeddedjcopy, the disagreement point in the second bargaining will bake this in:jmay be horrified to see thatiwants to experience its copy suffer, but will be powerless to stop it (ifiwon't budge in the bargaining). Conversely, if the embeddedjis treated as a user whichiwill imagine is exactly toi's liking, but who actually gets whatjwants, then the selectedμ0will be horrible fori(e.g. perhapsiwants to fully experience Hitler suffering, and instead gets to fully experience Hitler's wildest fantasies being realized). I don't think it's possible to do anything like denosing to avoid this. It may seem like this isn't a practical problem, since we could reasonably disallow such embedding. However, I think that's still tricky since there's a less exotic version of the issue: my experiences likely already are a collection of subagents' experiences. Presumably my maximisation overxjoeis permitted to determine all thexsubjoe. It's hard to see how you draw a principled line here: the ideal future for most people may easily be transhumanist to the point where today's users are tomorrow's subpersonalities (and beyond). A case that may have to be ruled out separately is whereiwants to become a sufferingj. Depending on what I consider 'me', I might be entirely fine with it if 'I' wake up tomorrow as sufferingj(if I'm done living and thinkjdeserves to suffer). Or perhaps I want to clone myself1010times, and then have all copies convert themselves to sufferingjs after
Soares, Tallinn, and Yudkowsky discuss AGI cognition

I'll note that you're a MIRI research associate, so I wouldn't have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.

There is ample discussion of distribution shifts ("seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set") by other people. Random examples: Christiano, Shah, DeepMind.

Maybe Eliezer is talking specifically about the context of transparency. Personally, I haven't worked much on transparency because IMO (i) even if we solve transparency perfectly but don'... (read more)

Morality is Scary

I'm imagining cooperative bargaining between all users, where the disagreement point is everyone dying[1][2] (this is a natural choice assuming that if we don't build aligned TAI we get paperclips). This guarantees that every user will receive an outcome that's at least not worse than death.

With Nash bargaining, we can still get issues for (in)famous people that millions of people want to do unpleasant things to. Their outcome will be better than death, but maybe worse than in my claimed "lower bound".

With Kalai-Smorodinsky bargaining things look better, s... (read more)

2Wei Dai2moAssuming each lawyer has the same incentive to lie as its client, it has an incentive to misrepresent that some preferable-to-death outcomes are "worse-than-death" (in order to force those outcomes out of the set of "feasible agreements" in hope of getting a more preferred outcome as the actual outcome), and this at equilibrium is balanced by the marginal increase in the probability of getting "everyone dies" as the outcome (due to feasible agreements becoming a null set) caused by the lie. So the probability of "everyone dies" in this game has to be non-zero. (It's the same kind of problem as in the AI race or tragedy of commons: people not taking into account the full social costs of their actions as they reach for private benefits.) Of course in actuality everyone dying may not be a realistic consequence of failure to reach agreement, but if the real consequence is better than that, and the AI lawyers know this, they would be more willing to lie since the perceived downside of lying would be smaller, so you end up with a higher chance of no agreement.
1Joe_Collman2moThis seems near guaranteed to me: a non-zero amount of people will be that crazy (in our terms), so filtering will be necessary. Then I'm curious about how we draw the line on outlier filtering. What filtering rule do we use? I don't yet see a good principled rule (e.g. if we want to throw out people who'd collapse agreement to the disagreement point, there's more than one way to do that).
Vanessa Kosoy's Shortform

The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.

The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it's an ill-posed... (read more)