# All of AlexMennen's Comments + Replies

Dutch-Booking CDT: Revised Argument

I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract fo... (read more)

OK, here's my position.

As I said in the post, the real answer is that this argument simply does not apply if the agent knows its action. More generally: the argument applies precisely to those actions to which the agent ascribes positive probability (directly before deciding). So, it is possible for agents to maintain a difference between counterfactual and evidential expectations. However, I think it's rarely normatively correct for an agent to be in such a position.

Even though the decision procedure of CDT is deterministic, this does not mean that agents... (read more)

2Abram Demski8moI thought about these things in writing this, but I'll have to think about them again before making a full reply. Another similar scenario would be: we assume the probability of an action is small if it's sub-optimal, but smaller the worse it is.
Utility Maximization = Description Length Minimization

I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

An overview of 11 proposals for building safe advanced AI

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

Relaxed adversarial training for inner alignment

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by... (read more)

1Evan Hubinger1yYep—that's one of the main concerns. The idea, though, is that all you have to deal with should be a standard overfitting problem, since you don't need the acceptability predicate to work once the model is deceptive, only beforehand. Thus, you should only have to worry about gradient descent overfitting to the acceptability signal, not the model actively trying to trick you—which I think is solvable overfitting problem. Currently, my hope is that you can do that via using the acceptability signal to enforce an easy-to-verify condition that rules out deception such as myopia [https://www.alignmentforum.org/posts/BKM8uQS6QdJPZLqCr/towards-a-mechanistic-understanding-of-corrigibility] .
An overview of 11 proposals for building safe advanced AI

Is there a difference between training competitiveness and performance competitiveness? My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance. If this is the case, then whether a factor influencing competitiveness is framed as affecting the cost of training or as affecting the performance of the final product, either way it's just affecting the efficiency with which putting resources towards training leads to good perfor... (read more)

My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance.

I think this is incorrect. Most training setups eventually flatline, or close to it (e.g. see AlphaZero's ELO curve), and need algorithmic or other improvements to do better.

3Adam Shimi1yI believe the main difference is that training is a one-time cost. Thus lacking training competitiveness is less an issue than lacking performance competitiveness, as the latter is a recurrent cost.
An Orthodox Case Against Utility Functions
This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem.

Sure, I guess I just always talk about VNM instead of Savage because I never bothered to learn how Savage's version works. Perhaps I should.

As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution compari
An Orthodox Case Against Utility Functions
In the Savage framework, an outcome already encodes everything you care about.

Yes, but if you don't know which outcome is the true one, so you're considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems
An Orthodox Case Against Utility Functions

I agree that the considerations you mentioned in your example are not changes in values, and didn't mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

1Ben Pace2yAh, I see, that makes sense.
An Orthodox Case Against Utility Functions

It seems to me that the Jeffrey-Bolker framework is a poor match for what's going on in peoples' heads when they make value judgements, compared to the VNM framework. If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don't know exactly what the outcome will be with certainty. This isn't to say that I literally consider probability distributions ... (read more)

3Abram Demski2yPerhaps it goes without saying, but obviously, both frameworks are flexible enough to allow for most phenomena -- the question here is what is more natural in one framework or another. My main argument is that the procrastination paradox is not natural at all in a Savage framework, as it suggests an uncomputable utility function. I think this plausibly outweighs the issue you're pointing at. But with respect to the issue you are pointing at: In the Savage framework, an outcome already encodes everything you care about. So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios. Alternately, we might say (as as Savage said) that the Savage axioms apply to "small worlds" -- small scenarios which the agent abstracts from its experience, such as the decision of whether to break an egg for an omelette. These can be easily considered by the agent, if it can assign values "from outside the problem" in an appropriate way. But then, to account for the breadth of human reasoning, it seems to me we also want an account of things like extending a small world when we find that it isn't sufficient, and coherence between different small-world frames for related decisions. This gives a picture very much like the Jeffrey-Bolker picture, in that we don't really work with outcomes which completely specify everything we care about, but rather, work with a variety of simplified outcomes with coherence requirements between simpler and more complex views. So overall I think it is better to have some picture where you can break things up in a more tractable way, rather than having full outcomes which you need to pass through to get values. In the Jeffrey-Bolker framework, you can re-estimate the value of an event by breaking it up into pieces, estimating the val
3Ben Pace2yI don't understand JB yet, but when I introspected just now, my experience of decision-making doesn't have any separation between beliefs and values, so I think I disagree with the above. I'll try to explain why by describing my experience. (Note: Long comment below is just saying one very simple thing. Sorry for length. There's a one-line tl;dr at the end.) Right now I'm considering doing three different things. I can go and play a videogame that my friend suggested we play together, I can do some LW work with my colleague, or I can go play some guitar/piano. I feel like the videogame isn't very fun right now because I think the one my friend suggested not that interesting of a shared experience. I feel like the work is fun because I'm excited about publishing the results of the work, and the work itself involves a kind of cognition I enjoy. And playing piano is fun because I've been skilling up a lot lately and I'm going to do accompany some of my housemates in some hamilton songs. Now, I know some likely ways that what seems valuable to me might change. There are other videogames I've played lately that have been really fascinating and rewarding to play together, that involve problem solving where 2 people can be creative together. I can imagine the work turning out to not actuallybe the fun part but the boring parts. I can imagine that I've found no traction (skill-up) in playing piano, or that we're going to use a recorded soundtrack rather than my playing for the songs we're learning. All of these to me feel like updates in my understanding of what events are reachable to me; this doesn't feel like changing my utility evaluation of the events. The event of "play videogame while friend watches bored" could change to "play videogame while creatively problem-solving with friend". The event of "gain skill in piano and then later perform songs well with friends" could change to "struggle to do something difficult and sound bad and that's it". If I think about c
An Orthodox Case Against Utility Functions

I think we're going to have to back up a bit. Call the space of outcomes and the space of Turing machines . It sounds like you're talking about two functions, and . I was thinking of as the utility function we were talking about, but it seems you were thinking of .

You suggested should be computable but should not be. It seems to me that should certainly be computable (with the caveat that it might be a partial function, rather than a total function), as computation is the only thing Turing... (read more)

1orthonormal2yI've been using computable to mean a total function (each instance is computable in finite time). I'm thinking of an agent outside a universe about to take an action, and each action will cause that universe to run a particular TM. (You could maybe frame this as "the agent chooses the tape for the TM to run on".) For me, this is analogous to acting in the world and causing the world to shift toward some outcomes over others. By asserting that U should be the computable one, I'm asserting that "how much do I like this outcome" is a more tractable question than "which actions result in this outcome". An intuition pump in a human setting: I can check whether given states of a Go board are victories for one player or the other, or if the game is not yet finished (this is analogous to U being a total computable function). But it's much more difficult to choose, for an unfinished game where I'm told I have a winning strategy, a move such that I still have a winning strategy. The best I can really do as a human is calculate a bit and then guess at how the leaves will probably resolve if we go down them (this is analogous to eval being an enumerable but not necessarily computable function). In general, individual humans are much better at figuring out what outcomes we want than we are at figuring out exactly how to achieve those outcomes. (It would be quite weird if the opposite were the case.) We're not good at either in an absolute sense, of course.
An Orthodox Case Against Utility Functions

It's not clear to me what this means in the context of a utility function.

1orthonormal2yLet's talk first about non-embedded agents. Say that I'm given the specification of a Turing machine, and I have a computable utility mapping from output states (including "does not halt") to [0,1]. We presumably agree that is possible. I agree that it's impossible to make a computable mapping from Turing machines to outcomes, so therefore I cannot have a computable utility function from TMs to the reals which assigns the same value to any two TMs with identical output. But I can have a logical inductor which, for each TM, produces a sequence of predictions about that TM's output's utility. Every TM that halts will eventually get the correct utility, and every TM that doesn't will converge to some utility in [0,1], with the usual properties for logical inductors guaranteeing that TMs easily proven to have the same output will converge to the same number, etc. That's a computable sequence of utility functions over TMs with asymptotic good properties. At any stage, I could stop and tell you that I choose some particular TM as the best one as it seems to me now. I haven't really thought in a long while about questions like "do logical inductors' good properties of self-prediction mean that they could avoid the procrastination paradox", so I could be talking nonsense there.
An Orthodox Case Against Utility Functions

I'm not sure what it would mean for a real-valued function to be enumerable. You could call a function enumerable if there's a program that takes as input and enumerates the rationals that are less than , but I don't think this is what you want, since presumably if a Turing machine halting can generate a positive amount of utility that doesn't depend on the number of steps taken before halting, then it could generate a negative amount of utility by halting as well.

I think accepting the type of reasoning you g... (read more)

1orthonormal2yI mean the sort of "eventually approximately consistent over computable patterns" thing exhibited by logical inductors, which is stronger than limit-computability.
An Orthodox Case Against Utility Functions
we need not assume there are "worlds" at all. ... In mathematics, it brings to mind pointless topology.

I don't think the motivation for this is quite the same as the motivation for pointless topology, which is designed to mimic classical topology in a way that Jeffrey-Bolker-style decision theory does not mimic VNM-style decision theory. In pointless topology, a continuous function of locales is a function from the lattice of open sets of to the lattice of open sets of . So a similar thing here would be to treat a utility funct... (read more)

2Abram Demski2yPart of the point of the JB axioms is that probability is constructed together with utility in the representation theorem, in contrast to VNM, which constructs utility via the representation theorem, but takes probability as basic. This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem. As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution comparisons), where JB only requires preference comparison of events which the agent sees as real possibilities. A similar remark can be made of Savage. Right, that's fair. Although: James Joyce, the big CDT advocate, is quite the Jeffrey-Bolker fan! See Why We Still Need the Logic of Decision [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.321.7829&rep=rep1&type=pdf] for his reasons. Doesn't pointless topology allow for some distinctions which aren't meaningful in pointful topology, though? (I'm not really very familiar, I'm just going off of something I've heard.) Isn't the approach you mention pretty close to JB? You're not modeling the VNM/Savage thing of arbitrary gambles; you're just assigning values (and probabilities) to events, like in JB. Setting aside VNM and Savage and JB, and considering the most common approach in practice -- use the Kolmogorov axioms of probability, and treat utility as a random variable -- it seems like the pointless analogue would be close to what you say. Yeah. The question remains, though: should we think of utility as a function of these minimal elements of the completion? Or not? The computability issue I raise is, to me, suggestive of the negative.
AlexMennen's Shortform

Theorem: Fuzzy beliefs (as in https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v ) form a continuous DCPO. (At least I'm pretty sure this is true. I've only given proof sketches so far)

The relevant definitions:

A fuzzy belief over a set is a concave function such that (where is the space of probability distributions on ). Fuzzy beliefs are partially ordered by ... (read more)

What are we assuming about utility functions?

Ok, I see what you mean about independence of irrelevant alternatives only being a real coherence condition when the probabilities are objective (or otherwise known to be equal because they come from the same source, even if there isn't an objective way of saying what their common probability is).

But I disagree that this makes VNM only applicable to settings in which all sources of uncertainty have objectively correct probabilities. As I said in my previous comment, you only need there to exist some source of objective probabilities, and you can then ... (read more)

1johnswentworth2yLet me repeat back your argument as I understand it. If we have a Bayesian utility maximizing agent, that's just a probabilistic inference layer with a VNM utility maximizer sitting on top of it. So our would-be arbitrageur comes along with a source of "objective" randomness, like a quantum random number generator. The arbitrageur wants to interact with the VNM layer, so it needs to design bets to which the inference layer assigns some specific probability. It does that by using the "objective" randomness source in the bet design: just incorporate that randomness in such a way that the inference layer assigns the probabilities the arbitrageur wants. This seems correct insofar as it applies. It is a useful perspective, and not one I had thought much about before this, so thanks for bringing it in. The main issue I still don't see resolved by this argument is the architecture question [https://www.lesswrong.com/posts/osxNg6yBCJ4ur9hpi/does-agent-like-behavior-imply-agent-like-architecture] . The coherence theorems only say that an agent must act as if they perform Bayesian inference and then choose the option with highest expected value based on those probabilities. In the agent's actual internal architecture, there need not be separate modules for inference and decision-making (a Kalman filter is one example). If we can't neatly separate the two pieces somehow, then we don't have a good way to construct lotteries with specified probabilities, so we don't have a way to treat the agent as a VNM-type agent. This directly follows from the original main issue: VNM utility theory is built on the idea that probabilities live in the environment, not in the agent. If there's a neat separation between the agent's inference and decision modules, then we can redefine the inference module to be part of the environment, but that neat separation need not always exist. EDIT: Also, I should point out explicitly that VNM alone doesn't tell us why we ever expect probabilities to b
What are we assuming about utility functions?

I think you're underestimating VNM here.

only two of those four are relevant to coherence. The main problem is that the axioms relevant to coherence (acyclicity and completeness) do not say anything at all about probability

It seems to me that the independence axiom is a coherence condition, unless I misunderstand what you mean by coherence?

correctly point out problems with VNM

I'm curious what problems you have in mind, since I don't think VNM has problems that don't apply to similar coherence theorems.

VNM utility stipulates that agents h
3johnswentworth2yI would argue that independence of irrelevant alternatives is not a real coherence criterion. It looks like one at first glance: if it's violated, then you get an Allais Paradox-type situation where someone pays to throw a switch and then pays to throw it back. The problem is, the "arbitrage" of throwing the switch back and forth hinges on the assumption that the stated probabilities are objectively correct. It's entirely possible for someone to come along who believes that throwing the switch changes the probabilities in a way that makes it a good deal. Then there's no real arbitrage, it just comes down to whose probabilities better match the outcomes. My intuition for this not being real arbitrage comes from finance. In finance, we'd call it "statistical arbitrage": it only works if the probabilities are correct. The major lesson of the collapse of Long Term Capital Management [https://en.wikipedia.org/wiki/Long-Term_Capital_Management] in the 90's is that statistical arbitrage is definitely not real arbitrage. The whole point of true arbitrage is that it does not depend on your statistical model being correct . This directly leads to the difference between VNM and Bayesian expected utility maximization. In VNM, agents have preferences over lotteries: the probabilities of each outcome are inputs to the preference function. In Bayesian expected utility maximization, the only inputs to the preference function are the choices available to the agent - figuring out the probabilities of each outcome under each choice is the agent's job. (I do agree that we can set up situations where objectively correct probabilities are a reasonable model, e.g. in a casino, but the point of coherence theorems is to be pretty generally applicable. A theorem only relevant to casinos isn't all that interesting.)
[AN #66]: Decomposing robustness into capability robustness and alignment robustness
I do, however, believe that the single step cooperate-defect game which they use to come up with their factors seems like a very simple model for what will be a very complex system of interactions. For example, AI development will take place over time, and it is likely that the same companies will continue to interact with one another. Iterated games have very different dynamics, and I hope that future work will explore how this would affect their current recommendations, and whether it would yield new approaches to incentivizing cooperation.

It may be diff... (read more)

A Critique of Functional Decision Theory

I object to the framing of the bomb scenario on the grounds that low probabilities of high stakes are a source of cognitive bias that trip people up for reasons having nothing to do with FDT. Consider the following decision problem: "There is a button. If you press the button, you will be given $100. Also, pressing the button has a very small (one in a trillion trillion) chance of causing you to burn to death." Most people would not touch that button. Using the same payoffs and probabilies in a scenario to challenge FDT thus exploits cognitive bi... (read more) A Critique of Functional Decision Theory I don't know if I'm a simulation or a real person. A possible response to this argument is that the predictor may be able to accurately predict the agent without explicitly simulating them. A possible counter-response to this is to posit that any sufficiently accurate model of a conscious agent is necessarily conscious itself, whether the model takes the form of an explicit simulation or not. Troll Bridge I think the counterfactuals used by the agent are the correct counterfactuals for someone else to use while reasoning about the agent from the outside, but not the correct counterfactuals for the agent to use while deciding what to do. After all, knowing the agent's source code, if you see it start to cross the bridge, it is correct to infer that it's reasoning is inconsistent, and you should expect to see the troll blow up the bridge. But while deciding what to do, the agent should be able to reason about purely causal effects of its counterfact... (read more) 1Gurkenglas5moSuppose the bridge is safe iff there's a proof that the bridge is safe. Then you would forbid the reasoning "Suppose I cross. I must have proven it's safe. Then it's safe, and I get 10. Let's cross.", which seems sane enough in the face of Löb. 3Abram Demski2yI agree with everything you say here, but I read you as thinking you disagree with me. Yeah, that's the problem I'm pointing at, right? I think we just agree on that? As I responded [https://www.lesswrong.com/posts/hpAbfXtqYC2BrpeiC/troll-bridge-5#YZMPzBkd7ZyEx9G6D] to another comment here: Safely and usefully spectating on AIs optimizing over toy worlds The agent could be programmed to have a certain hard-coded ontology rather than searching through all possible hypotheses weighted by description length. 1Vanessa Kosoy3yMy point is, I don't think it's possible to implement a strong computationally feasible agent which doesn't search through possible hypotheses, because solving the optimization problem for the hard-coded ontology is intractable. In other words, what gives intelligence its power is precisely the search through possible hypotheses. Safely and usefully spectating on AIs optimizing over toy worlds Are you worried about leaks from the abstract computational process into the real world, leaks from the real world into the abstract computational process, or both? (Or maybe neither and I'm misunderstanding your concern?) There will definitely be tons of leaks from the abstract computational process into the real world; just looking at the result is already such a leak. The point is that the AI should have no incentive to optimize such leaks, not that the leaks don't exist, so the existence of additional leaks that we didn't know about shoul... (read more) Probabilistic Tiling (Preliminary Attempt) What I meant was that the computation isn't extremely long in the sense of description length, not in the sense of computation time. Also, we aren't doing policy search over the set of all turing machines, we're doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on) Wouldn't the set of all action sequences have lower description length than some large finite set of policies? There's also the potential problem that all of the policies in the large finite set you're searching over could be quite far from optimal. Probabilistic Tiling (Preliminary Attempt) Ok, understood on the second assumption. is not a function to , but a function to the set of -valued random variables, and your assumption is that this random variable is uncorrelated with certain claims about the outputs of certain policies. The intuitive explanation of the third condition made sense; my complaint was that even with the intended interpretation at hand, the formal statement made no sense to me. I'm pretty sure you're assuming that is resolved on day , not that it is resolved eventually. Searching over the set of all ... (read more) 1Diffractor3yAh, the formal statement was something like "if the policy A isn't the argmax policy, the successor policy B must be in the policy space of the future argmax, and the action selected by policy A is computed so the relevant equality holds" Yeah, I am assuming fast feedback that it is resolved on day n . What I meant was that the computation isn't extremely long in the sense of description length, not in the sense of computation time. Also, we aren't doing policy search over the set of all turing machines, we're doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on) Also I'm less confident in conditional future-trust for all conditionals than I used to be, I'll try to crystallize where I think it goes wrong. Safely and usefully spectating on AIs optimizing over toy worlds This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you're suggesting we build an AI without preferences, but it doesn't sound like that.) Ok, here's another attempt to explain what I meant. Somewhere in the platonic realm of abstract mathematical structures, there is a small world with physics quite a lot like ours, containing an AI running on some idealized computational hardware, and trying to arrange the rest of the small world so that it has some desired property. Human... (read more) 1Rohin Shah3yAh, I see. That does make it seem clearer to me, though I'm not sure what beliefs actually changed. Probabilistic Tiling (Preliminary Attempt) I suggest stating the result you're proving before giving the proof. You have some unusual notation that I think makes some of this unnecessarily confusing. Instead of this underlined vs non-underlined thing, you should have different functions$ and , where the first maps action sequences to utilities, and the second maps a pair consisting of an action and a future policy to the utility of the action sequence beginning with , followed by , followed by the action sequence generated by . Your first assumption ... (read more)

1Diffractor3yFirst: That notation seems helpful. Fairness of the environment isn't present by default, it still needs to be assumed even if the environment is purely action-determined, as you can consider an agent in the environment that is using a hardwired predictor of what the argmax agent would do. It is just a piece of the environment, and feeding a different sequence of actions into the environment as input gets a different score, so the environment is purely action-determined, but it's still unfair in the sense that the expected utility of feeding action x into the function drops sharply if you condition on the argmax agent selecting action x. The third condition was necessary to carry out this step. En(U(a∗1:n−1,a?–––,a2:∞?))=En(U(a∗1:n−1,a1:∞?)) . The intuitive interpretation of the third condition is that, if you know that policy B selects action 4, then you can step from "action 4 is taken" to "policy B takes the actions it takes", and if you have a policy where you don't know what action it takes (third condition is violated), then "policy B does its thing" may have a higher expected utility than any particular action being taken, even in a fair environment that only cares about action sequences, as the hamster dance example shows. Second: I think you misunderstood what I was claiming. I wasn't claiming that logical inductors attain the conditional future-trust property, even in the limit, for all sentences or all true sentences. What I was claiming was: The fact that ϕ is provable or disprovable in the future (in this case, ϕ is ‘‘a∗n=x "), makes the conditional future-trust property hold (I'm fairly sure), and for statements where there isn't guaranteed feedback, the conditional future-trust property may fail. The double-expectation property that you state does not work to carry the proof through, because the proof (from the perspective of the first agent), takes ϕ as an assumption, so the "conditional on ϕ" part has to be outside of the future expectation, when
1Raymond Arnold3yNote: looks like you were trying to use markdown. To use markdown in our editor you need to press cmd-4. (Originally the "\$" notation worked, but people who weren't familiar with LaTeX were consistently confused about to actually type a dollar sign)
Safely and usefully spectating on AIs optimizing over toy worlds

The model I had in mind was that the AI and the toy world are both abstract computational processes with no causal influence from our world, and that we are merely simulating/spectating on both the AI itself and the toy world it optimizes. If the AI messes with people simulating it so that they end up simulating a similar AI with more compute, this can give it more influence over these peoples' simulation of the toy world the AI is optimizing, but it doesn't give the AI any more influence over the abstract computational process that it (another a... (read more)

1Rohin Shah3yThis model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you're suggesting we build an AI without preferences, but it doesn't sound like that.) I think there's a lot of common sense that humans apply that allows them to design solutions that meet many implicit constraints that they can't easily verbalize. "Thinking outside of the box" is when a human manages to design something that doesn't satisfy one of the constraints, because it turns out that constraint wasn't useful. But in most cases, those constraints are very useful, because they make the search space much smaller. By default, these constraints won't carry over into the virtual world. (Lots of examples of this in The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities [https://arxiv.org/abs/1803.03453])
Safely and usefully spectating on AIs optimizing over toy worlds

I agree. I didn't mean to imply that I thought this step would be easy, and I would also be interested in more concrete ways of doing it. It's possible that creating a hereditarily restricted optimizer along the lines I was suggesting could end up being approximately as difficult as creating an aligned general-purpose optimizer, but I intuitively don't expect this to be the case.

Meta: IAFF vs LessWrong

There should be a chat icon on the bottom-right of the screen on Alignment Forum that you can use to talk to the admins (unless only people who have already been approved can see this?). You can also comment on LW (Alignment Forum posts are automatically crossposted to LW), and ask the admins to make it show up on Alignment Forum afterwards.

Meta: IAFF vs LessWrong

There is a replacement for IIAF now: https://www.alignmentforum.org/

1Jessica Taylor3yApparently "You must be approved by an admin to comment on Alignment Forum", how do I do this? Also is this officially the successor to IAFF? If so it would be good to make that more clear on this website.
Clarifying Consequentialists in the Solomonoff Prior

I don't think that specifying the property of importance is simple and helps narrow down S. I think that in order for predicting S to be important, S must be generated by a simple process. Processes that take large numbers of bits to specify are correspondingly rarely occurring, and thus less useful to predict.

3Paul Christiano3yI don't buy it. A camera that some robot is using to make decisions is no simpler than any other place on Earth, just more important. (This already gives the importance-weighted predictor a benefit of ~log(quadrillion)) Clearly you need to e.g. make the anthropic update and do stuff like that before you have any chance of competing with the consequentialist. This might just be a quantitative difference about how simple is simple---like I said elsewhere, all the action is in the additive constants, I agree that the important things are "simple" in some sense.
Clarifying Consequentialists in the Solomonoff Prior
Suppose that I just specify a generic feature of a simulation that can support life + expansion (the complexity of specifying "a simulation that can support life" is also paid by the intended hypothesis, so we can factor it out). Over a long enough time such a simulation will produce life, that life will spread throughout the simulation, and eventually have some control over many features of that simulation.

Oh yes, I see. That does cut the complexity overhead down a lot.

Once you've specified the agent, it just samples randomly from the di
Clarifying Consequentialists in the Solomonoff Prior

I didn't mean that an agenty Turing machine would find S and then decide that it wants you to correctly predict S. I meant that to the extent that predicting S is commonly useful, there should be a simple underlying reason why it is commonly useful, and this reason should give you a natural way of computing S that does not have the overhead of any agency that decides whether or not it wants you to correctly predict S.

1Paul Christiano3yHow many bits do you think it takes to specify the property "people's predictions about S, using universal prior P, are very important"? (I think you'll need to specify the universal prior P by reference to the universal prior that is actually used in the world containing the string S, if you spell out the prior P explicitly you are already sunk just from the ambiguity in the choice of language.) It seems relatively unlikely to me that this will be cheaper than specifying some arbitrary degree of freedom in a computationally rich universe that life can control (+ the extra log(fraction of degrees of freedom the consequentialists actually choose to control)). Of course it might. I agree that the entire game is in the constants---what is the cheapest way to pick out important strings.
Clarifying Consequentialists in the Solomonoff Prior

This reasoning seems to rely on there being such strings S that are useful to predict far out of proportion to what you would expect from their complexity. But a description of the circumstance in which predicting S is so useful should itself give you a way of specifying S, so I doubt that this is possible.

1Vladimir Mikulik3yI agree. That’s what I meant when I wrote there will be TMs that artificially promote S itself. However, this would still mean that most of S’s mass in the prior would be due to these TMs, and not due to the natural generator of the string. Furthermore, it’s unclear how many TMs would promote S vs S’ or other alternatives. Because of this, I don’t now whether the prior would be higher for S or S’ from this reasoning alone. Whichever is the case, the prior no longer reflects meaningful information about the universe that generates S and whose inhabitants are using the prefix to choose what to do; it’s dominated by these TMs that search for prefixes they can attempt to influence.
A universal score for optimizers

I think decision problems with incomplete information are a better model in which to measure optimization power than deterministic decision problems with complete information are. If the agent knows exactly what payoffs it would get from each action, it is hard to explain why it might not choose the optimal one. In the example I gave, the first agent could have mistakenly concluded that the .9-utility action was better than the 1-utility action while making only small errors in estimating the consequences of each of its actions, while the second agent would need to make large errors in estimating the consequences of its actions in order to think that the .1-utility action was better than the 1-utility action.

Clarifying Consequentialists in the Solomonoff Prior

I'm not convinced that the probability of S' could be pushed up to anything near the probability of S. Specifying an agent that wants to trick you into predicting S' rather than S with high probability when you see their common prefix requires specifying the agency required to plan this type of deception (which should be quite complicated), and specifying the common prefix of S and S' as the particular target for the deception (which, insofar as it makes sense to say that S is the "correct" continuation of the prefix, should h... (read more)

3Paul Christiano3ySuppose that I just specify a generic feature of a simulation that can support life + expansion (the complexity of specifying "a simulation that can support life" is also paid by the intended hypothesis, so we can factor it out). Over a long enough time such a simulation will produce life, that life will spread throughout the simulation, and eventually have some control over many features of that simulation. Once you've specified the agent, it just samples randomly from the distribution of "strings I want to influence." That has a way lower probability than the "natural" complexity of a string I want to influence. For example, if 1/quadrillion strings are important to influence, then the attackers are able to save log(quadrillion) bits.
1Vladimir Mikulik3yI agree that this probably happens when you set out to mess with an arbitrary particular S, I.e. try to make some S’ that shares a prefix with S as likely as S. However, some S are special, in the sense that their prefixes are being used to make very important decisions. If you, as a malicious TM in the prior, perform an exhaustive search of universes, you can narrow down your options to only a few prefixes used to make pivotal decisions, selecting one of those to mess with is then very cheap to specify. I use S to refer to those strings that are the ‘natural’ continuation of those cheap-to-specify prefixes. There are, it seems to me, a bunch of other equally-complex TMs that want to make other strings that share that prefix more likely, including some that promote S itself. What the resulting balance looks like is unclear to me, but what’s clear is that the prior is malign with respect to that prefix - conditioning on that prefix gives you a distribution almost entirely controlled by these malign TMs. The ‘natural’ complexity of S, or of other strings that share the prefix, play almost no role in their priors. The above is of course conditional on this exhaustive search being possible, which also relies on there being anyone in any universe that actually uses the prior to make decisions. Otherwise, we can’t select the prefixes that can be messed with.
An environment for studying counterfactuals

The multi-armed bandit problem is a many-round problem in which actions in early rounds provide information that is useful for later rounds, so it makes sense to explore to gain this information. That's different from using exploration in one-shot problems to make the counterfactuals well-defined, which is a hack.

A universal score for optimizers

Some undesirable properties of C-score:

It depends on how the space of actions are represented. If a set of very similar actions that achieve the same utility for the agent are merged into one action, this will change the agent's C-score.

It does not depend on the magnitudes of the agent's preferences, only on their orderings. Compare 2 agents: the first has 3 available actions, which would give it utilities 0, .9, and 1, respectively, and it picks the action that would give it utility .9. The second has 3 available actions, which would give it uti... (read more)

2levin3yI agree with the first point, and I don't have solid solutions to this. There's also the fact that some games are easier to optimize than others (name a number game I described at the end vs. chess), and this complexity is impossible to capture while staying computation-agnostic. Maybe one can use the length of the shortest proof that taking action a leads to utility u(a) to account for these issues.. The second point is more controversial, my intuition is that first agent is an equally good optimizer, even if it did better in terms of payoffs. Also, at least in the setting of deterministic games, utility functions are arbitrary up to encoding the same preference orderings (once randomness is introduced this stops being true)
The Learning-Theoretic AI Alignment Research Agenda

A related question is, whether it is possible to design an algorithm for strong AI based on simple mathematical principles, or whether any strong AI will inevitably be an enormous kludge of heuristics designed by trial and error. I think that we have some empirical support for the former, given that humans evolved to survive in a certain environment but succeeded to use their intelligence to solve problems in very different environments.

I don't understand this claim. It seems to me that human brains appear to be "an enormous kludge of heuristics designed by trial and error". Shouldn't the success of humans be evidence for the latter?

0Vanessa Kosoy3yThe fact that the human brain was designed by trial and error is a given. However, we don't really know how the brain works. It is possible that the brain contains a simple mathematical core, possibly implemented inefficiently and with bugs and surrounded by tonnes of legacy code, but nevertheless responsible for the broad applicability of human intelligence. Consider the following two views (which might also admit some intermediates): View A: There exists a simple mathematical algorithm M that corresponds to what we call "intelligence" and that allows solving any problem in some very broad natural domain D. View B: What we call intelligence is a collection of a large number of unrelated algorithms tailored to individual problems, and there is no "meta-algorithm" that produces them aside from relatively unsophisticated trial and error. If View B is correct, then we expect that doing trial and error on a collection X of problems will produce an algorithm that solves problems in X and almost only in X. The probability that you were optimizing for X but solved a much larger domain Y is vanishingly small: it is about the same as the probability of a completely random algorithm to solve all problems in Y∖X. If View A is correct, then we expect that doing trial and error on X has a non-negligible chance of producing M (since M is simple and therefore sampled with a relatively large probability), which would be able to solve all of D. So, the fact that homo sapiens evolved in a some prehistoric environment but was able to e.g. land on the moon should be surprising to everyone with View B but not surprising to those with View A.
Some Criticisms of the Logical Induction paper

I think the asymptotic nature of LI is much more worrying than the things you compare it to. In order for an asymptotic result to be encouraging for practical applications, we need it to be an asymptotic result such that in most cases of interest when it holds, you tend to also have good finite results. If you give a algorithm for some problem, this is an encouraging sign that maybe you can develop a practical algorithm. But if a problem is known not to be in , and then you give an exponential-time algorithm for it, and argue that it can probably b

1Vanessa Kosoy4yAlex, the difference between your point of view and my own, is that I don't care that much about the applications of LI to formal logic. I'm much more interested in its applications to forecasting [https://arxiv.org/abs/1705.04630]. If Fletcher is worried about the same issues as you, IMO it is not very clear from 'eir essay. On the other hand, the original paper doesn't talk much about applications outside of formal logic, and my countercriticism might be slightly unfair since it's grounded in a perspective that Fletcher doesn't know about.
A cheating approach to the tiling agents problem

I think the general principle you're taking advantage of is:

Do a small amount of very predictable reasoning in PA+Sound(PA). Then use PA for the rest of the reasoning you have to do. When reasoning about other instances of agents similar to you, simulate their use of Sound(PA), and trust the reasoning that they did while confining themselves to PA.

In your example, PA+Con(PA) sufficed, but PA+Sound(PA) is more flexible in general, in ways that might be important. This also seems to solve the closely related problem of how to trust statements if you know

0Vladimir Slepnev4yI think the right direction is to think about what we want from self-modification informally, and refine that into a toy problem that the agent in the post can't solve. It's confusing in a philosophical kind of way. I don't have the right skill for it, but I've been trying to mine Eliezer's Arbital writings.
Cooperative Oracles: Stratified Pareto Optima and Almost Stratified Pareto Optima

This notion of dependency seems too binary to me. Concretely, let's modify your example from the beginning so that must grant an extra utility to either or , and gets to decide which. Now, everyone's utility depends on everyone's actions, and the game is still zero-sum, so again, so any strategy profile with will be a stratified Pareto optimum. But it seems like and should ignore still ignore .

0Scott Garrabrant4yI agree with this. I think that the most interesting direction of future work is to figure out how to have better notions of dependency. I plan on writing some on this in the future, but basically we have not successfully figured out how to deal with this.
Futarchy Fix

When modeling the incentives to change probabilities of events, it probably makes sense to model the payoff of changing probabilities of events and the cost of changing probabilities of events separately. You'd expect someone to alter the probabilities if they gain more in expectation from the bets than the cost to them of altering the probabilities. If someone bets on an event and changes the probability that it occurs from to , then their expected payoff is times their investment, so if, in a prediction market in which there are possible outcom

Formal Open Problem in Decision Theory

This comment is to explain partial results obtained by David Simmons in this thread, since the results and their proofs are difficult to follow as a result of being distributed across multiple comments, with many errors and corrections. The proofs given here are due to David Simmons.

#Statements

Theorem 1: There is no metric space and uniformly continuous function such that every uniformly continuous function is a fiber of .

Theorem 2: There is no metric space and function that is uniformly continuous on bounded sets, such

Formal Open Problem in Decision Theory

Hm, perhaps I should figure out what the significance of uniform continuity on bounded sets is in constructive analysis before dismissing it, even though I don't see the appeal myself, since constructive analysis is not a field I know much about, but could potentially be relevant here.

is the reciprocal of what it was before, but yes, this looks good. I am happy with this proof.

Intertheoretic utility comparison: simple theory

For strategies: This ties back in to the situation where there's an observable event that you can condition your strategy on, and the strategy space has a product structure . This product structure seems important, since you should generally expect utility functions to factor in the sense that for some functions and , where is the probability of (I think for the relevance section, you want to assume that whenever there is such a product structure, is supported on utility functions that factor, and you can d

0Stuart Armstrong5yOK, got a better formalism: https://agentfoundations.org/item?id=1449
0Stuart Armstrong5yI think I've got something that works; I'll post it tomorrow.
Formal Open Problem in Decision Theory

It appears that comments from new users are collapsed by default, and cannot be replied to without a "Like". These seem like bad features.

Your proof that there's no uniformly continuous on bounded sets function admitting all uniformly continuous on bounded sets functions as fibers looks correct now. It also looks like it can be easily adapted to show that there is no uniformly continuous admitting all uniformly continuous functions as fibers. Come to think of it, your proof works for arbitrary metric spaces , not ju

1David Simmons5yI will have to think more about the issue of continuity vs uniform continuity. I suppose my last remaining argument would be the fact that Bishop--Bridges' classic book on constructive analysis uses uniform continuity on bounded sets rather than continuity, which suggests that it is probably better for constructive analysis at least. But maybe they did not analyze the issue carefully enough, or maybe the relevant issues here are for some reason different. To fix the argument that every locally compact Polish space admits a proper metric, let f be as before and let F(x,y)=∞ if d(x,y)≥f(x) and F(x,y)=f(x)/[f(x) −d(x,y)] if d(x,y)<f(x). Next, let g(y)=minn[n+F(xn,y)], where (xn) is a countable dense sequence. Then g is continuous and everywhere finite. Moreover, if S=g−1([0,N]), then S⊆⋃n≤NB(xn,(1−1/N)f(xn)) and thus S is compact. It follows that the metric d′(x,y)=d(x,y)+|g(y)−g(x)| is proper. Anyway I have fixed the typo in my previous post.
Formal Open Problem in Decision Theory

This is intended as a reply to David Simmons's comment, which for some reason I can't reply to directly.

In the new version of your proof, how do we know isn't too close to for some ? And how do we know that is uniformly continuous on bounded subsets?

About continuity versus uniform continuity on bounded sets:

It seems to me that your point 1 is just a pithier version of your point 4, and that these points support paying attention to uniform continuity, rather than uniform continuity restricted to bounded sets. This version of the problem seem