# All of AlexMennen's Comments + Replies

Oh, derp. You're right.

I think the way I would rule out my counterexample is by strengthening A3 to if  and  then there is ...

5Scott Garrabrant8mo
That does not rule out your counterexample. The condition is never met in your counterexample.

Q2: No. Counterexample: Suppose there's one outcome  such that all lotteries are equally good, except for the lottery than puts probability 1 on , which is worse than the others.

4Scott Garrabrant8mo
I meant the conclusions to all be adding to the previous one, so this actually also answers the main question I stated, by violating continuity, but not the main question I care about. I will edit the post to say that I actually care about concavity, even without continuity.
4Scott Garrabrant8mo
Nice! This, of course, seems like something we should salvage, by e.g. adding an axiom that if A is strictly preferred to B, there should be a lottery strictly between them.

It would kind of use assumption 3 inside step 1, but inside the syntax, rather than in the metalanguage. That is, step 1 involves checking that the number encoding "this proof" does in fact encode a proof of C. This can't be done if you never end up proving C.

One thing that might help make clear what's going on is that you can follow the same proof strategy, but replace "this proof" with "the usual proof of Lob's theorem", and get another valid proof of Lob's theorem, that goes like this: Suppose you can prove that []C->C, and let n be the number encodi...

6Gurkenglas1y
Similarly: löb = □ (□ A → A) → □ A □löb = □ (□ (□ A → A) → □ A) □löb -> löb: löb premises □ (□ A → A). By internal necessitation, □ (□ (□ A → A)). By □löb, □ (□ A). By löb's premise, □ A.

If that's how it works, it doesn't lead to a simplified cartoon guide for readers who'll notice missing steps or circular premises; they'd have to first walk through Lob's Theorem in order to follow this "simplified" proof of Lob's Theorem.

It sounds to me like, in the claim "deep learning is uninterpretable", the key word in "deep learning" that makes this claim true is "learning", and you're substituting the similar-sounding but less true claim "deep neural networks are uninterpretable" as something to argue against. You're right that deep neural networks can be interpretable if you hand-pick the semantic meanings of each neuron in advance and carefully design the weights of the network such that these intended semantic meanings are correct, but that's not what deep learning is. The other t...

This seems related in spirit to the fact that time is only partially ordered in physics as well. You could even use special relativity to make a model for concurrency ambiguity in parallel computing: each processor is a parallel worldline, detecting and sending signals at points in spacetime that are spacelike-separated from when the other processors are doing these things. The database follows some unknown worldline, continuously broadcasts its contents, and updates its contents when it receives instructions to do so. The set of possible ways that the pro...

If you mark something like causally inescapable subsets of spacetime (not sure how this should be called), which are something like all unions of future lightcones, as open sets, then specialization preorder on spacetime points will agree with time. This topology on spacetime is non-Frechet (has nontrivial specialization preorder), while the relative topologies it gives on space-like subspaces (loci of states of the world "at a given time" in a loose sense) are Hausdorff, the standard way of giving a topology for such spaces. This seems like the most straightforward setting for treating physical time as logical time.

Wikipedia claims that every sequence is Turing reducible to a random one, giving a positive answer to the non-resource-bounded version of any question of this form. There might be a resource-bounded version of this result as well, but I'm not sure.

This post claims that having the necessary technical skills probably means grad-level education, and also that you should have a broad technical background. While I suppose these claims are probably both true, it's worth pointing out that there's a tension between them, in that PhD programs typically aim to develop narrow skillsets, rather than broad ones. Often the first year of a PhD program will focus on acquiring a moderately broad technical background, and then rapidly get progressively more specialized, until you're writing a thesis, at which point w...

4johnswentworth2y
Strong agree. A lot of the technical material which I think is relevant is typically not taught until the grad level, but that does not mean that actually finishing a PhD program is useful. Indeed, I sometimes joke that dropping out of a PhD program is one of the most widely-recognized credentials by people currently in the field - you get the general technical background skills, and also send a very strong signal of personal agency.

the still-confusing revised slogan that all computable functions are continuous

For anyone who still finds this confusing, I think I can give a pretty quick explanation of this.

The reason I'd imagine it might sound confusing is that you can think of what seem like simple counterexamples. E.g. you can write a short function in your favorite programming language that takes a floating-point real as input, and returns 1 if the input is 0, and returns 0 otherwise. This appears to be a computation of the indicator function for {0}, which is discontinuous. But it ...

I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract fo...

OK, here's my position.

As I said in the post, the real answer is that this argument simply does not apply if the agent knows its action. More generally: the argument applies precisely to those actions to which the agent ascribes positive probability (directly before deciding). So, it is possible for agents to maintain a difference between counterfactual and evidential expectations. However, I think it's rarely normatively correct for an agent to be in such a position.

Even though the decision procedure of CDT is deterministic, this does not mean that agents...

2Abram Demski3y
I thought about these things in writing this, but I'll have to think about them again before making a full reply. Another similar scenario would be: we assume the probability of an action is small if it's sub-optimal, but smaller the worse it is.

I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by...

1Evan Hubinger3y
Yep—that's one of the main concerns. The idea, though, is that all you have to deal with should be a standard overfitting problem, since you don't need the acceptability predicate to work once the model is deceptive, only beforehand. Thus, you should only have to worry about gradient descent overfitting to the acceptability signal, not the model actively trying to trick you—which I think is solvable overfitting problem. Currently, my hope is that you can do that via using the acceptability signal to enforce an easy-to-verify condition that rules out deception such as myopia.

Is there a difference between training competitiveness and performance competitiveness? My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance. If this is the case, then whether a factor influencing competitiveness is framed as affecting the cost of training or as affecting the performance of the final product, either way it's just affecting the efficiency with which putting resources towards training leads to good perfor...

My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance.

I think this is incorrect. Most training setups eventually flatline, or close to it (e.g. see AlphaZero's ELO curve), and need algorithmic or other improvements to do better.

I believe the main difference is that training is a one-time cost. Thus lacking training competitiveness is less an issue than lacking performance competitiveness, as the latter is a recurrent cost.
This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem.

Sure, I guess I just always talk about VNM instead of Savage because I never bothered to learn how Savage's version works. Perhaps I should.

As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution compari
...
In the Savage framework, an outcome already encodes everything you care about.

Yes, but if you don't know which outcome is the true one, so you're considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems
...

I agree that the considerations you mentioned in your example are not changes in values, and didn't mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

1Ben Pace4y
Ah, I see, that makes sense.

It seems to me that the Jeffrey-Bolker framework is a poor match for what's going on in peoples' heads when they make value judgements, compared to the VNM framework. If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don't know exactly what the outcome will be with certainty. This isn't to say that I literally consider probability distributions ...

3Abram Demski4y
Perhaps it goes without saying, but obviously, both frameworks are flexible enough to allow for most phenomena -- the question here is what is more natural in one framework or another. My main argument is that the procrastination paradox is not natural at all in a Savage framework, as it suggests an uncomputable utility function. I think this plausibly outweighs the issue you're pointing at. But with respect to the issue you are pointing at: In the Savage framework, an outcome already encodes everything you care about. So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios. Alternately, we might say (as as Savage said) that the Savage axioms apply to "small worlds" -- small scenarios which the agent abstracts from its experience, such as the decision of whether to break an egg for an omelette. These can be easily considered by the agent, if it can assign values "from outside the problem" in an appropriate way. But then, to account for the breadth of human reasoning, it seems to me we also want an account of things like extending a small world when we find that it isn't sufficient, and coherence between different small-world frames for related decisions. This gives a picture very much like the Jeffrey-Bolker picture, in that we don't really work with outcomes which completely specify everything we care about, but rather, work with a variety of simplified outcomes with coherence requirements between simpler and more complex views. So overall I think it is better to have some picture where you can break things up in a more tractable way, rather than having full outcomes which you need to pass through to get values. In the Jeffrey-Bolker framework, you can re-estimate the value of an event by breaking it up into pieces, estimating the val
3Ben Pace4y
I don't understand JB yet, but when I introspected just now, my experience of decision-making doesn't have any separation between beliefs and values, so I think I disagree with the above. I'll try to explain why by describing my experience. (Note: Long comment below is just saying one very simple thing. Sorry for length. There's a one-line tl;dr at the end.) Right now I'm considering doing three different things. I can go and play a videogame that my friend suggested we play together, I can do some LW work with my colleague, or I can go play some guitar/piano. I feel like the videogame isn't very fun right now because I think the one my friend suggested not that interesting of a shared experience. I feel like the work is fun because I'm excited about publishing the results of the work, and the work itself involves a kind of cognition I enjoy. And playing piano is fun because I've been skilling up a lot lately and I'm going to do accompany some of my housemates in some hamilton songs. Now, I know some likely ways that what seems valuable to me might change. There are other videogames I've played lately that have been really fascinating and rewarding to play together, that involve problem solving where 2 people can be creative together. I can imagine the work turning out to not actuallybe the fun part but the boring parts. I can imagine that I've found no traction (skill-up) in playing piano, or that we're going to use a recorded soundtrack rather than my playing for the songs we're learning. All of these to me feel like updates in my understanding of what events are reachable to me; this doesn't feel like changing my utility evaluation of the events. The event of "play videogame while friend watches bored" could change to "play videogame while creatively problem-solving with friend". The event of "gain skill in piano and then later perform songs well with friends" could change to "struggle to do something difficult and sound bad and that's it". If I think about c

I think we're going to have to back up a bit. Call the space of outcomes and the space of Turing machines . It sounds like you're talking about two functions, and . I was thinking of as the utility function we were talking about, but it seems you were thinking of .

You suggested should be computable but should not be. It seems to me that should certainly be computable (with the caveat that it might be a partial function, rather than a total function), as computation is the only thing Turing...

1orthonormal4y
I've been using computable to mean a total function (each instance is computable in finite time). I'm thinking of an agent outside a universe about to take an action, and each action will cause that universe to run a particular TM. (You could maybe frame this as "the agent chooses the tape for the TM to run on".) For me, this is analogous to acting in the world and causing the world to shift toward some outcomes over others. By asserting that U should be the computable one, I'm asserting that "how much do I like this outcome" is a more tractable question than "which actions result in this outcome". An intuition pump in a human setting: I can check whether given states of a Go board are victories for one player or the other, or if the game is not yet finished (this is analogous to U being a total computable function). But it's much more difficult to choose, for an unfinished game where I'm told I have a winning strategy, a move such that I still have a winning strategy. The best I can really do as a human is calculate a bit and then guess at how the leaves will probably resolve if we go down them (this is analogous to eval being an enumerable but not necessarily computable function). In general, individual humans are much better at figuring out what outcomes we want than we are at figuring out exactly how to achieve those outcomes. (It would be quite weird if the opposite were the case.) We're not good at either in an absolute sense, of course.

It's not clear to me what this means in the context of a utility function.

1orthonormal4y
Let's talk first about non-embedded agents. Say that I'm given the specification of a Turing machine, and I have a computable utility mapping from output states (including "does not halt") to [0,1]. We presumably agree that is possible. I agree that it's impossible to make a computable mapping from Turing machines to outcomes, so therefore I cannot have a computable utility function from TMs to the reals which assigns the same value to any two TMs with identical output. But I can have a logical inductor which, for each TM, produces a sequence of predictions about that TM's output's utility. Every TM that halts will eventually get the correct utility, and every TM that doesn't will converge to some utility in [0,1], with the usual properties for logical inductors guaranteeing that TMs easily proven to have the same output will converge to the same number, etc. That's a computable sequence of utility functions over TMs with asymptotic good properties. At any stage, I could stop and tell you that I choose some particular TM as the best one as it seems to me now. I haven't really thought in a long while about questions like "do logical inductors' good properties of self-prediction mean that they could avoid the procrastination paradox", so I could be talking nonsense there.

I'm not sure what it would mean for a real-valued function to be enumerable. You could call a function enumerable if there's a program that takes as input and enumerates the rationals that are less than , but I don't think this is what you want, since presumably if a Turing machine halting can generate a positive amount of utility that doesn't depend on the number of steps taken before halting, then it could generate a negative amount of utility by halting as well.

I think accepting the type of reasoning you g...

1orthonormal4y
I mean the sort of "eventually approximately consistent over computable patterns" thing exhibited by logical inductors, which is stronger than limit-computability.
we need not assume there are "worlds" at all. ... In mathematics, it brings to mind pointless topology.

I don't think the motivation for this is quite the same as the motivation for pointless topology, which is designed to mimic classical topology in a way that Jeffrey-Bolker-style decision theory does not mimic VNM-style decision theory. In pointless topology, a continuous function of locales is a function from the lattice of open sets of to the lattice of open sets of . So a similar thing here would be to treat a utility funct...

2Abram Demski4y
Part of the point of the JB axioms is that probability is constructed together with utility in the representation theorem, in contrast to VNM, which constructs utility via the representation theorem, but takes probability as basic. This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem. As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution comparisons), where JB only requires preference comparison of events which the agent sees as real possibilities. A similar remark can be made of Savage. Right, that's fair. Although: James Joyce, the big CDT advocate, is quite the Jeffrey-Bolker fan! See Why We Still Need the Logic of Decision for his reasons. Doesn't pointless topology allow for some distinctions which aren't meaningful in pointful topology, though? (I'm not really very familiar, I'm just going off of something I've heard.) Isn't the approach you mention pretty close to JB? You're not modeling the VNM/Savage thing of arbitrary gambles; you're just assigning values (and probabilities) to events, like in JB. Setting aside VNM and Savage and JB, and considering the most common approach in practice -- use the Kolmogorov axioms of probability, and treat utility as a random variable -- it seems like the pointless analogue would be close to what you say. Yeah. The question remains, though: should we think of utility as a function of these minimal elements of the completion? Or not? The computability issue I raise is, to me, suggestive of the negative.

Theorem: Fuzzy beliefs (as in https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v ) form a continuous DCPO. (At least I'm pretty sure this is true. I've only given proof sketches so far)

The relevant definitions:

A fuzzy belief over a set is a concave function such that (where is the space of probability distributions on ). Fuzzy beliefs are partially ordered by ...

Ok, I see what you mean about independence of irrelevant alternatives only being a real coherence condition when the probabilities are objective (or otherwise known to be equal because they come from the same source, even if there isn't an objective way of saying what their common probability is).

But I disagree that this makes VNM only applicable to settings in which all sources of uncertainty have objectively correct probabilities. As I said in my previous comment, you only need there to exist some source of objective probabilities, and you can then ...

1johnswentworth4y
Let me repeat back your argument as I understand it. If we have a Bayesian utility maximizing agent, that's just a probabilistic inference layer with a VNM utility maximizer sitting on top of it. So our would-be arbitrageur comes along with a source of "objective" randomness, like a quantum random number generator. The arbitrageur wants to interact with the VNM layer, so it needs to design bets to which the inference layer assigns some specific probability. It does that by using the "objective" randomness source in the bet design: just incorporate that randomness in such a way that the inference layer assigns the probabilities the arbitrageur wants. This seems correct insofar as it applies. It is a useful perspective, and not one I had thought much about before this, so thanks for bringing it in. The main issue I still don't see resolved by this argument is the architecture question. The coherence theorems only say that an agent must act as if they perform Bayesian inference and then choose the option with highest expected value based on those probabilities. In the agent's actual internal architecture, there need not be separate modules for inference and decision-making (a Kalman filter is one example). If we can't neatly separate the two pieces somehow, then we don't have a good way to construct lotteries with specified probabilities, so we don't have a way to treat the agent as a VNM-type agent. This directly follows from the original main issue: VNM utility theory is built on the idea that probabilities live in the environment, not in the agent. If there's a neat separation between the agent's inference and decision modules, then we can redefine the inference module to be part of the environment, but that neat separation need not always exist. EDIT: Also, I should point out explicitly that VNM alone doesn't tell us why we ever expect probabilities to be relevant to anything in the first place. If we already have a Bayesian expected utility maximizer with sep

I think you're underestimating VNM here.

only two of those four are relevant to coherence. The main problem is that the axioms relevant to coherence (acyclicity and completeness) do not say anything at all about probability

It seems to me that the independence axiom is a coherence condition, unless I misunderstand what you mean by coherence?

correctly point out problems with VNM

I'm curious what problems you have in mind, since I don't think VNM has problems that don't apply to similar coherence theorems.

VNM utility stipulates that agents h
...
3johnswentworth4y
I would argue that independence of irrelevant alternatives is not a real coherence criterion. It looks like one at first glance: if it's violated, then you get an Allais Paradox-type situation where someone pays to throw a switch and then pays to throw it back. The problem is, the "arbitrage" of throwing the switch back and forth hinges on the assumption that the stated probabilities are objectively correct. It's entirely possible for someone to come along who believes that throwing the switch changes the probabilities in a way that makes it a good deal. Then there's no real arbitrage, it just comes down to whose probabilities better match the outcomes. My intuition for this not being real arbitrage comes from finance. In finance, we'd call it "statistical arbitrage": it only works if the probabilities are correct. The major lesson of the collapse of Long Term Capital Management in the 90's is that statistical arbitrage is definitely not real arbitrage. The whole point of true arbitrage is that it does not depend on your statistical model being correct . This directly leads to the difference between VNM and Bayesian expected utility maximization. In VNM, agents have preferences over lotteries: the probabilities of each outcome are inputs to the preference function. In Bayesian expected utility maximization, the only inputs to the preference function are the choices available to the agent - figuring out the probabilities of each outcome under each choice is the agent's job. (I do agree that we can set up situations where objectively correct probabilities are a reasonable model, e.g. in a casino, but the point of coherence theorems is to be pretty generally applicable. A theorem only relevant to casinos isn't all that interesting.)
I do, however, believe that the single step cooperate-defect game which they use to come up with their factors seems like a very simple model for what will be a very complex system of interactions. For example, AI development will take place over time, and it is likely that the same companies will continue to interact with one another. Iterated games have very different dynamics, and I hope that future work will explore how this would affect their current recommendations, and whether it would yield new approaches to incentivizing cooperation.

It may be diff...

I object to the framing of the bomb scenario on the grounds that low probabilities of high stakes are a source of cognitive bias that trip people up for reasons having nothing to do with FDT. Consider the following decision problem: "There is a button. If you press the button, you will be given $100. Also, pressing the button has a very small (one in a trillion trillion) chance of causing you to burn to death." Most people would not touch that button. Using the same payoffs and probabilies in a scenario to challenge FDT thus exploits cognitive bi... I don't know if I'm a simulation or a real person. A possible response to this argument is that the predictor may be able to accurately predict the agent without explicitly simulating them. A possible counter-response to this is to posit that any sufficiently accurate model of a conscious agent is necessarily conscious itself, whether the model takes the form of an explicit simulation or not. I think the counterfactuals used by the agent are the correct counterfactuals for someone else to use while reasoning about the agent from the outside, but not the correct counterfactuals for the agent to use while deciding what to do. After all, knowing the agent's source code, if you see it start to cross the bridge, it is correct to infer that it's reasoning is inconsistent, and you should expect to see the troll blow up the bridge. But while deciding what to do, the agent should be able to reason about purely causal effects of its counterfact... 1Gurkenglas2y Suppose the bridge is safe iff there's a proof that the bridge is safe. Then you would forbid the reasoning "Suppose I cross. I must have proven it's safe. Then it's safe, and I get 10. Let's cross.", which seems sane enough in the face of Löb. 3Abram Demski4y I agree with everything you say here, but I read you as thinking you disagree with me. Yeah, that's the problem I'm pointing at, right? I think we just agree on that? As I responded to another comment here: The agent could be programmed to have a certain hard-coded ontology rather than searching through all possible hypotheses weighted by description length. 1Vanessa Kosoy5y My point is, I don't think it's possible to implement a strong computationally feasible agent which doesn't search through possible hypotheses, because solving the optimization problem for the hard-coded ontology is intractable. In other words, what gives intelligence its power is precisely the search through possible hypotheses. Are you worried about leaks from the abstract computational process into the real world, leaks from the real world into the abstract computational process, or both? (Or maybe neither and I'm misunderstanding your concern?) There will definitely be tons of leaks from the abstract computational process into the real world; just looking at the result is already such a leak. The point is that the AI should have no incentive to optimize such leaks, not that the leaks don't exist, so the existence of additional leaks that we didn't know about shoul... What I meant was that the computation isn't extremely long in the sense of description length, not in the sense of computation time. Also, we aren't doing policy search over the set of all turing machines, we're doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on) Wouldn't the set of all action sequences have lower description length than some large finite set of policies? There's also the potential problem that all of the policies in the large finite set you're searching over could be quite far from optimal. Ok, understood on the second assumption. is not a function to , but a function to the set of -valued random variables, and your assumption is that this random variable is uncorrelated with certain claims about the outputs of certain policies. The intuitive explanation of the third condition made sense; my complaint was that even with the intended interpretation at hand, the formal statement made no sense to me. I'm pretty sure you're assuming that is resolved on day , not that it is resolved eventually. Searching over the set of all ... 1Diffractor5y Ah, the formal statement was something like "if the policy A isn't the argmax policy, the successor policy B must be in the policy space of the future argmax, and the action selected by policy A is computed so the relevant equality holds" Yeah, I am assuming fast feedback that it is resolved on day n . What I meant was that the computation isn't extremely long in the sense of description length, not in the sense of computation time. Also, we aren't doing policy search over the set of all turing machines, we're doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on) Also I'm less confident in conditional future-trust for all conditionals than I used to be, I'll try to crystallize where I think it goes wrong. This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you're suggesting we build an AI without preferences, but it doesn't sound like that.) Ok, here's another attempt to explain what I meant. Somewhere in the platonic realm of abstract mathematical structures, there is a small world with physics quite a lot like ours, containing an AI running on some idealized computational hardware, and trying to arrange the rest of the small world so that it has some desired property. Human... 1Rohin Shah5y Ah, I see. That does make it seem clearer to me, though I'm not sure what beliefs actually changed. I suggest stating the result you're proving before giving the proof. You have some unusual notation that I think makes some of this unnecessarily confusing. Instead of this underlined vs non-underlined thing, you should have different functions$ and , where the first maps action sequences to utilities, and the second maps a pair consisting of an action and a future policy to the utility of the action sequence beginning with , followed by , followed by the action sequence generated by . Your first assumption ...

1Diffractor5y
First: That notation seems helpful. Fairness of the environment isn't present by default, it still needs to be assumed even if the environment is purely action-determined, as you can consider an agent in the environment that is using a hardwired predictor of what the argmax agent would do. It is just a piece of the environment, and feeding a different sequence of actions into the environment as input gets a different score, so the environment is purely action-determined, but it's still unfair in the sense that the expected utility of feeding action x into the function drops sharply if you condition on the argmax agent selecting action x. The third condition was necessary to carry out this step. En(U(a∗1:n−1,a?–––,a2:∞?))=En(U(a∗1:n−1,a1:∞?)) . The intuitive interpretation of the third condition is that, if you know that policy B selects action 4, then you can step from "action 4 is taken" to "policy B takes the actions it takes", and if you have a policy where you don't know what action it takes (third condition is violated), then "policy B does its thing" may have a higher expected utility than any particular action being taken, even in a fair environment that only cares about action sequences, as the hamster dance example shows. Second: I think you misunderstood what I was claiming. I wasn't claiming that logical inductors attain the conditional future-trust property, even in the limit, for all sentences or all true sentences. What I was claiming was: The fact that ϕ is provable or disprovable in the future (in this case, ϕ is ‘‘a∗n=x"), makes the conditional future-trust property hold (I'm fairly sure), and for statements where there isn't guaranteed feedback, the conditional future-trust property may fail. The double-expectation property that you state does not work to carry the proof through, because the proof (from the perspective of the first agent), takes ϕ as an assumption, so the "conditional on ϕ" part has to be outside of the future expectation, when
1Raymond Arnold5y
Note: looks like you were trying to use markdown. To use markdown in our editor you need to press cmd-4. (Originally the "\$" notation worked, but people who weren't familiar with LaTeX were consistently confused about to actually type a dollar sign)

The model I had in mind was that the AI and the toy world are both abstract computational processes with no causal influence from our world, and that we are merely simulating/spectating on both the AI itself and the toy world it optimizes. If the AI messes with people simulating it so that they end up simulating a similar AI with more compute, this can give it more influence over these peoples' simulation of the toy world the AI is optimizing, but it doesn't give the AI any more influence over the abstract computational process that it (another a...

1Rohin Shah5y
This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you're suggesting we build an AI without preferences, but it doesn't sound like that.) I think there's a lot of common sense that humans apply that allows them to design solutions that meet many implicit constraints that they can't easily verbalize. "Thinking outside of the box" is when a human manages to design something that doesn't satisfy one of the constraints, because it turns out that constraint wasn't useful. But in most cases, those constraints are very useful, because they make the search space much smaller. By default, these constraints won't carry over into the virtual world. (Lots of examples of this in The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities)

I agree. I didn't mean to imply that I thought this step would be easy, and I would also be interested in more concrete ways of doing it. It's possible that creating a hereditarily restricted optimizer along the lines I was suggesting could end up being approximately as difficult as creating an aligned general-purpose optimizer, but I intuitively don't expect this to be the case.

There should be a chat icon on the bottom-right of the screen on Alignment Forum that you can use to talk to the admins (unless only people who have already been approved can see this?). You can also comment on LW (Alignment Forum posts are automatically crossposted to LW), and ask the admins to make it show up on Alignment Forum afterwards.

There is a replacement for IIAF now: https://www.alignmentforum.org/

1Jessica Taylor5y
Apparently "You must be approved by an admin to comment on Alignment Forum", how do I do this? Also is this officially the successor to IAFF? If so it would be good to make that more clear on this website.

I don't think that specifying the property of importance is simple and helps narrow down S. I think that in order for predicting S to be important, S must be generated by a simple process. Processes that take large numbers of bits to specify are correspondingly rarely occurring, and thus less useful to predict.

3Paul Christiano5y
I don't buy it. A camera that some robot is using to make decisions is no simpler than any other place on Earth, just more important. (This already gives the importance-weighted predictor a benefit of ~log(quadrillion)) Clearly you need to e.g. make the anthropic update and do stuff like that before you have any chance of competing with the consequentialist. This might just be a quantitative difference about how simple is simple---like I said elsewhere, all the action is in the additive constants, I agree that the important things are "simple" in some sense.
Suppose that I just specify a generic feature of a simulation that can support life + expansion (the complexity of specifying "a simulation that can support life" is also paid by the intended hypothesis, so we can factor it out). Over a long enough time such a simulation will produce life, that life will spread throughout the simulation, and eventually have some control over many features of that simulation.

Oh yes, I see. That does cut the complexity overhead down a lot.

Once you've specified the agent, it just samples randomly from the di
...

I didn't mean that an agenty Turing machine would find S and then decide that it wants you to correctly predict S. I meant that to the extent that predicting S is commonly useful, there should be a simple underlying reason why it is commonly useful, and this reason should give you a natural way of computing S that does not have the overhead of any agency that decides whether or not it wants you to correctly predict S.

1Paul Christiano5y
How many bits do you think it takes to specify the property "people's predictions about S, using universal prior P, are very important"? (I think you'll need to specify the universal prior P by reference to the universal prior that is actually used in the world containing the string S, if you spell out the prior P explicitly you are already sunk just from the ambiguity in the choice of language.) It seems relatively unlikely to me that this will be cheaper than specifying some arbitrary degree of freedom in a computationally rich universe that life can control (+ the extra log(fraction of degrees of freedom the consequentialists actually choose to control)). Of course it might. I agree that the entire game is in the constants---what is the cheapest way to pick out important strings.

This reasoning seems to rely on there being such strings S that are useful to predict far out of proportion to what you would expect from their complexity. But a description of the circumstance in which predicting S is so useful should itself give you a way of specifying S, so I doubt that this is possible.

I agree. That’s what I meant when I wrote there will be TMs that artificially promote S itself. However, this would still mean that most of S’s mass in the prior would be due to these TMs, and not due to the natural generator of the string. Furthermore, it’s unclear how many TMs would promote S vs S’ or other alternatives. Because of this, I don’t now whether the prior would be higher for S or S’ from this reasoning alone. Whichever is the case, the prior no longer reflects meaningful information about the universe that generates S and whose inhabitants are using the prefix to choose what to do; it’s dominated by these TMs that search for prefixes they can attempt to influence.

I think decision problems with incomplete information are a better model in which to measure optimization power than deterministic decision problems with complete information are. If the agent knows exactly what payoffs it would get from each action, it is hard to explain why it might not choose the optimal one. In the example I gave, the first agent could have mistakenly concluded that the .9-utility action was better than the 1-utility action while making only small errors in estimating the consequences of each of its actions, while the second agent would need to make large errors in estimating the consequences of its actions in order to think that the .1-utility action was better than the 1-utility action.

I'm not convinced that the probability of S' could be pushed up to anything near the probability of S. Specifying an agent that wants to trick you into predicting S' rather than S with high probability when you see their common prefix requires specifying the agency required to plan this type of deception (which should be quite complicated), and specifying the common prefix of S and S' as the particular target for the deception (which, insofar as it makes sense to say that S is the "correct" continuation of the prefix, should h...

3Paul Christiano5y
Suppose that I just specify a generic feature of a simulation that can support life + expansion (the complexity of specifying "a simulation that can support life" is also paid by the intended hypothesis, so we can factor it out). Over a long enough time such a simulation will produce life, that life will spread throughout the simulation, and eventually have some control over many features of that simulation. Once you've specified the agent, it just samples randomly from the distribution of "strings I want to influence." That has a way lower probability than the "natural" complexity of a string I want to influence. For example, if 1/quadrillion strings are important to influence, then the attackers are able to save log(quadrillion) bits.
I agree that this probably happens when you set out to mess with an arbitrary particular S, I.e. try to make some S’ that shares a prefix with S as likely as S. However, some S are special, in the sense that their prefixes are being used to make very important decisions. If you, as a malicious TM in the prior, perform an exhaustive search of universes, you can narrow down your options to only a few prefixes used to make pivotal decisions, selecting one of those to mess with is then very cheap to specify. I use S to refer to those strings that are the ‘natural’ continuation of those cheap-to-specify prefixes. There are, it seems to me, a bunch of other equally-complex TMs that want to make other strings that share that prefix more likely, including some that promote S itself. What the resulting balance looks like is unclear to me, but what’s clear is that the prior is malign with respect to that prefix - conditioning on that prefix gives you a distribution almost entirely controlled by these malign TMs. The ‘natural’ complexity of S, or of other strings that share the prefix, play almost no role in their priors. The above is of course conditional on this exhaustive search being possible, which also relies on there being anyone in any universe that actually uses the prior to make decisions. Otherwise, we can’t select the prefixes that can be messed with.

The multi-armed bandit problem is a many-round problem in which actions in early rounds provide information that is useful for later rounds, so it makes sense to explore to gain this information. That's different from using exploration in one-shot problems to make the counterfactuals well-defined, which is a hack.

Some undesirable properties of C-score:

It depends on how the space of actions are represented. If a set of very similar actions that achieve the same utility for the agent are merged into one action, this will change the agent's C-score.

It does not depend on the magnitudes of the agent's preferences, only on their orderings. Compare 2 agents: the first has 3 available actions, which would give it utilities 0, .9, and 1, respectively, and it picks the action that would give it utility .9. The second has 3 available actions, which would give it uti...

2levin5y
I agree with the first point, and I don't have solid solutions to this. There's also the fact that some games are easier to optimize than others (name a number game I described at the end vs. chess), and this complexity is impossible to capture while staying computation-agnostic. Maybe one can use the length of the shortest proof that taking action a leads to utility u(a) to account for these issues.. The second point is more controversial, my intuition is that first agent is an equally good optimizer, even if it did better in terms of payoffs. Also, at least in the setting of deterministic games, utility functions are arbitrary up to encoding the same preference orderings (once randomness is introduced this stops being true)

A related question is, whether it is possible to design an algorithm for strong AI based on simple mathematical principles, or whether any strong AI will inevitably be an enormous kludge of heuristics designed by trial and error. I think that we have some empirical support for the former, given that humans evolved to survive in a certain environment but succeeded to use their intelligence to solve problems in very different environments.

I don't understand this claim. It seems to me that human brains appear to be "an enormous kludge of heuristics designed by trial and error". Shouldn't the success of humans be evidence for the latter?

0Vanessa Kosoy5y
The fact that the human brain was designed by trial and error is a given. However, we don't really know how the brain works. It is possible that the brain contains a simple mathematical core, possibly implemented inefficiently and with bugs and surrounded by tonnes of legacy code, but nevertheless responsible for the broad applicability of human intelligence. Consider the following two views (which might also admit some intermediates): View A: There exists a simple mathematical algorithm M that corresponds to what we call "intelligence" and that allows solving any problem in some very broad natural domain D. View B: What we call intelligence is a collection of a large number of unrelated algorithms tailored to individual problems, and there is no "meta-algorithm" that produces them aside from relatively unsophisticated trial and error. If View B is correct, then we expect that doing trial and error on a collection X of problems will produce an algorithm that solves problems in X and almost only in X. The probability that you were optimizing for X but solved a much larger domain Y is vanishingly small: it is about the same as the probability of a completely random algorithm to solve all problems in Y∖X. If View A is correct, then we expect that doing trial and error on X has a non-negligible chance of producing M (since M is simple and therefore sampled with a relatively large probability), which would be able to solve all of D. So, the fact that homo sapiens evolved in a some prehistoric environment but was able to e.g. land on the moon should be surprising to everyone with View B but not surprising to those with View A.