I think the way I would rule out my counterexample is by strengthening A3 to if and then there is ...

52mo

That does not rule out your counterexample. The condition is never met in your
counterexample.

Apr 15, 2023102

Q2: No. Counterexample: Suppose there's one outcome such that all lotteries are equally good, except for the lottery than puts probability 1 on , which is worse than the others.

42mo

I meant the conclusions to all be adding to the previous one, so this actually
also answers the main question I stated, by violating continuity, but not the
main question I care about. I will edit the post to say that I actually care
about concavity, even without continuity.

42mo

Nice! This, of course, seems like something we should salvage, by e.g. adding an
axiom that if A is strictly preferred to B, there should be a lottery strictly
between them.

It would kind of use assumption 3 inside step 1, but inside the syntax, rather than in the metalanguage. That is, step 1 involves checking that the number encoding "this proof" does in fact encode a proof of C. This can't be done if you never end up proving C.

One thing that might help make clear what's going on is that you can follow the same proof strategy, but replace "this proof" with "the usual proof of Lob's theorem", and get another valid proof of Lob's theorem, that goes like this: Suppose you can prove that []C->C, and let n be the number encodi...

66mo

Similarly:
löb = □ (□ A → A) → □ A
□löb = □ (□ (□ A → A) → □ A)
□löb -> löb:
löb premises □ (□ A → A).
By internal necessitation, □ (□ (□ A → A)).
By □löb, □ (□ A).
By löb's premise, □ A.

6mo810

If that's how it works, it doesn't lead to a simplified cartoon guide for readers who'll notice missing steps or circular premises; they'd have to first walk through Lob's Theorem in order to follow this "simplified" proof of Lob's Theorem.

1y5

It sounds to me like, in the claim "deep learning is uninterpretable", the key word in "deep learning" that makes this claim true is "learning", and you're substituting the similar-sounding but less true claim "deep neural networks are uninterpretable" as something to argue against. You're right that deep neural networks can be interpretable if you hand-pick the semantic meanings of each neuron in advance and carefully design the weights of the network such that these intended semantic meanings are correct, but that's not what deep learning is. The other t...

1y5

This seems related in spirit to the fact that time is only partially ordered in physics as well. You could even use special relativity to make a model for concurrency ambiguity in parallel computing: each processor is a parallel worldline, detecting and sending signals at points in spacetime that are spacelike-separated from when the other processors are doing these things. The database follows some unknown worldline, continuously broadcasts its contents, and updates its contents when it receives instructions to do so. The set of possible ways that the pro...

11y

If you mark something like causally inescapable subsets of spacetime (not sure
how this should be called), which are something like all unions of future
lightcones, as open sets, then specialization preorder
[https://www.lesswrong.com/posts/HCibBn3ZCZRwMwNEE/against-time-in-agent-models?commentId=J6XuJtZbpjJTAP8CX]
on spacetime points will agree with time. This topology on spacetime is
non-Frechet (has nontrivial specialization preorder), while the relative
topologies it gives on space-like subspaces (loci of states of the world "at a
given time" in a loose sense) are Hausdorff, the standard way of giving a
topology for such spaces. This seems like the most straightforward setting for
treating physical time as logical time.

1y11

This post claims that having the necessary technical skills probably means grad-level education, and also that you should have a broad technical background. While I suppose these claims are probably both true, it's worth pointing out that there's a tension between them, in that PhD programs typically aim to develop narrow skillsets, rather than broad ones. Often the first year of a PhD program will focus on acquiring a moderately broad technical background, and then rapidly get progressively more specialized, until you're writing a thesis, at which point w...

41y

Strong agree. A lot of the technical material which I think is relevant is
typically not taught until the grad level, but that does not mean that actually
finishing a PhD program is useful. Indeed, I sometimes joke that dropping out of
a PhD program is one of the most widely-recognized credentials by people
currently in the field - you get the general technical background skills, and
also send a very strong signal of personal agency.

1y3

the still-confusing revised slogan that all

computablefunctions are continuous

For anyone who still finds this confusing, I think I can give a pretty quick explanation of this.

The reason I'd imagine it might sound confusing is that you can think of what seem like simple counterexamples. E.g. you can write a short function in your favorite programming language that takes a floating-point real as input, and returns 1 if the input is 0, and returns 0 otherwise. This appears to be a computation of the indicator function for {0}, which is discontinuous. But it ...

2y10

I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract fo...

2y5

OK, here's my position.

As I said in the post, the real answer is that this argument simply does not apply if the agent knows its action. More generally: the argument applies precisely to those actions to which the agent ascribes positive probability (directly before deciding). So, it is possible for agents to maintain a difference between counterfactual and evidential expectations. However, I think it's rarely normatively correct for an agent to be in such a position.

Even though the decision procedure of CDT is deterministic, this does not mean that agents...

22y

I thought about these things in writing this, but I'll have to think about them
again before making a full reply.
Another similar scenario would be: we assume the probability of an action is
small if it's sub-optimal, but smaller the worse it is.

I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

3y3

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by...

13y

Yep—that's one of the main concerns. The idea, though, is that all you have to
deal with should be a standard overfitting problem, since you don't need the
acceptability predicate to work once the model is deceptive, only beforehand.
Thus, you should only have to worry about gradient descent overfitting to the
acceptability signal, not the model actively trying to trick you—which I think
is solvable overfitting problem. Currently, my hope is that you can do that via
using the acceptability signal to enforce an easy-to-verify condition that rules
out deception such as myopia
[https://www.alignmentforum.org/posts/BKM8uQS6QdJPZLqCr/towards-a-mechanistic-understanding-of-corrigibility].

3y7

Is there a difference between training competitiveness and performance competitiveness? My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance. If this is the case, then whether a factor influencing competitiveness is framed as affecting the cost of training or as affecting the performance of the final product, either way it's just affecting the efficiency with which putting resources towards training leads to good perfor...

3y4

My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance.

I think this is incorrect. Most training setups eventually flatline, or close to it (e.g. see AlphaZero's ELO curve), and need algorithmic or other improvements to do better.

33y

I believe the main difference is that training is a one-time cost. Thus lacking
training competitiveness is less an issue than lacking performance
competitiveness, as the latter is a recurrent cost.

This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem.

Sure, I guess I just always talk about VNM instead of Savage because I never bothered to learn how Savage's version works. Perhaps I should.

As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution compari...

In the Savage framework, an outcome already encodes everything you care about.

Yes, but if you don't know which outcome is the true one, so you're considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems...

I agree that the considerations you mentioned in your example are not changes in values, and didn't mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

13y

Ah, I see, that makes sense.

3y7

It seems to me that the Jeffrey-Bolker framework is a poor match for what's going on in peoples' heads when they make value judgements, compared to the VNM framework. If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don't know exactly what the outcome will be with certainty. This isn't to say that I literally consider probability distributions ...

33y

Perhaps it goes without saying, but obviously, both frameworks are flexible
enough to allow for most phenomena -- the question here is what is more natural
in one framework or another.
My main argument is that the procrastination paradox is not natural at all in a
Savage framework, as it suggests an uncomputable utility function. I think this
plausibly outweighs the issue you're pointing at.
But with respect to the issue you are pointing at:
In the Savage framework, an outcome already encodes everything you care about.
So the computation which seems to be suggested by Savage is to think of these
maximally-specified outcomes, assigning them probability and utility, and then
combining those to get expected utility. This seems to be very demanding: it
requires imagining these very detailed scenarios.
Alternately, we might say (as as Savage said) that the Savage axioms apply to
"small worlds" -- small scenarios which the agent abstracts from its experience,
such as the decision of whether to break an egg for an omelette. These can be
easily considered by the agent, if it can assign values "from outside the
problem" in an appropriate way.
But then, to account for the breadth of human reasoning, it seems to me we also
want an account of things like extending a small world when we find that it
isn't sufficient, and coherence between different small-world frames for related
decisions.
This gives a picture very much like the Jeffrey-Bolker picture, in that we don't
really work with outcomes which completely specify everything we care about, but
rather, work with a variety of simplified outcomes with coherence requirements
between simpler and more complex views.
So overall I think it is better to have some picture where you can break things
up in a more tractable way, rather than having full outcomes which you need to
pass through to get values.
In the Jeffrey-Bolker framework, you can re-estimate the value of an event by
breaking it up into pieces, estimating the val

33y

I don't understand JB yet, but when I introspected just now, my experience of
decision-making doesn't have any separation between beliefs and values, so I
think I disagree with the above. I'll try to explain why by describing my
experience. (Note: Long comment below is just saying one very simple thing.
Sorry for length. There's a one-line tl;dr at the end.)
Right now I'm considering doing three different things. I can go and play a
videogame that my friend suggested we play together, I can do some LW work with
my colleague, or I can go play some guitar/piano. I feel like the videogame
isn't very fun right now because I think the one my friend suggested not that
interesting of a shared experience. I feel like the work is fun because I'm
excited about publishing the results of the work, and the work itself involves a
kind of cognition I enjoy. And playing piano is fun because I've been skilling
up a lot lately and I'm going to do accompany some of my housemates in some
hamilton songs.
Now, I know some likely ways that what seems valuable to me might change. There
are other videogames I've played lately that have been really fascinating and
rewarding to play together, that involve problem solving where 2 people can be
creative together. I can imagine the work turning out to not actuallybe the fun
part but the boring parts. I can imagine that I've found no traction (skill-up)
in playing piano, or that we're going to use a recorded soundtrack rather than
my playing for the songs we're learning.
All of these to me feel like updates in my understanding of what events are
reachable to me; this doesn't feel like changing my utility evaluation of the
events. The event of "play videogame while friend watches bored" could change to
"play videogame while creatively problem-solving with friend". The event of
"gain skill in piano and then later perform songs well with friends" could
change to "struggle to do something difficult and sound bad and that's it".
If I think about c

I think we're going to have to back up a bit. Call the space of outcomes and the space of Turing machines . It sounds like you're talking about two functions, and . I was thinking of as the utility function we were talking about, but it seems you were thinking of .

You suggested should be computable but should not be. It seems to me that should certainly be computable (with the caveat that it might be a partial function, rather than a total function), as computation is the only thing Turing...

13y

I've been using computable to mean a total function (each instance is computable
in finite time).
I'm thinking of an agent outside a universe about to take an action, and each
action will cause that universe to run a particular TM. (You could maybe frame
this as "the agent chooses the tape for the TM to run on".) For me, this is
analogous to acting in the world and causing the world to shift toward some
outcomes over others.
By asserting that U should be the computable one, I'm asserting that "how much
do I like this outcome" is a more tractable question than "which actions result
in this outcome".
An intuition pump in a human setting:
I can check whether given states of a Go board are victories for one player or
the other, or if the game is not yet finished (this is analogous to U being a
total computable function). But it's much more difficult to choose, for an
unfinished game where I'm told I have a winning strategy, a move such that I
still have a winning strategy. The best I can really do as a human is calculate
a bit and then guess at how the leaves will probably resolve if we go down them
(this is analogous to eval being an enumerable but not necessarily computable
function).
In general, individual humans are much better at figuring out what outcomes we
want than we are at figuring out exactly how to achieve those outcomes. (It
would be quite weird if the opposite were the case.) We're not good at either in
an absolute sense, of course.

13y

Let's talk first about non-embedded agents.
Say that I'm given the specification of a Turing machine, and I have a
computable utility mapping from output states (including "does not halt") to
[0,1]. We presumably agree that is possible.
I agree that it's impossible to make a computable mapping from Turing machines
to outcomes, so therefore I cannot have a computable utility function from TMs
to the reals which assigns the same value to any two TMs with identical output.
But I can have a logical inductor which, for each TM, produces a sequence of
predictions about that TM's output's utility. Every TM that halts will
eventually get the correct utility, and every TM that doesn't will converge to
some utility in [0,1], with the usual properties for logical inductors
guaranteeing that TMs easily proven to have the same output will converge to the
same number, etc.
That's a computable sequence of utility functions over TMs with asymptotic good
properties. At any stage, I could stop and tell you that I choose some
particular TM as the best one as it seems to me now.
I haven't really thought in a long while about questions like "do logical
inductors' good properties of self-prediction mean that they could avoid the
procrastination paradox", so I could be talking nonsense there.

I'm not sure what it would mean for a real-valued function to be enumerable. You could call a function enumerable if there's a program that takes as input and enumerates the rationals that are less than , but I don't think this is what you want, since presumably if a Turing machine halting can generate a positive amount of utility that doesn't depend on the number of steps taken before halting, then it could generate a negative amount of utility by halting as well.

I think accepting the type of reasoning you g...

13y

I mean the sort of "eventually approximately consistent over computable
patterns" thing exhibited by logical inductors, which is stronger than
limit-computability.

3y2

we need not assume there are "worlds" at all. ...In mathematics, it brings to mind pointless topology.

I don't think the motivation for this is quite the same as the motivation for pointless topology, which is designed to mimic classical topology in a way that Jeffrey-Bolker-style decision theory does not mimic VNM-style decision theory. In pointless topology, a continuous function of locales is a function from the lattice of open sets of to the lattice of open sets of . So a similar thing here would be to treat a utility funct...

23y

Part of the point of the JB axioms is that probability is constructed together
with utility in the representation theorem, in contrast to VNM, which constructs
utility via the representation theorem, but takes probability as basic.
This makes Savage a better comparison point, since the Savage axioms are more
similar to the VNM framework while also trying to construct probability and
utility together with one representation theorem.
As a representation theorem, this makes VNM weaker and JB stronger: VNM requires
stronger assumptions (it requires that the preference structure include
information about all these probability-distribution comparisons), where JB only
requires preference comparison of events which the agent sees as real
possibilities. A similar remark can be made of Savage.
Right, that's fair. Although: James Joyce, the big CDT advocate, is quite the
Jeffrey-Bolker fan! See Why We Still Need the Logic of Decision
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.321.7829&rep=rep1&type=pdf]
for his reasons.
Doesn't pointless topology allow for some distinctions which aren't meaningful
in pointful topology, though? (I'm not really very familiar, I'm just going off
of something I've heard.)
Isn't the approach you mention pretty close to JB? You're not modeling the
VNM/Savage thing of arbitrary gambles; you're just assigning values (and
probabilities) to events, like in JB.
Setting aside VNM and Savage and JB, and considering the most common approach in
practice -- use the Kolmogorov axioms of probability, and treat utility as a
random variable -- it seems like the pointless analogue would be close to what
you say.
Yeah. The question remains, though: should we think of utility as a function of
these minimal elements of the completion? Or not? The computability issue I
raise is, to me, suggestive of the negative.

Theorem: Fuzzy beliefs (as in https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v ) form a continuous DCPO. (At least I'm pretty sure this is true. I've only given proof sketches so far)

The relevant definitions:

A fuzzy belief over a set is a concave function such that (where is the space of probability distributions on ). Fuzzy beliefs are partially ordered by ...

Ok, I see what you mean about independence of irrelevant alternatives only being a real coherence condition when the probabilities are objective (or otherwise known to be equal because they come from the same source, even if there isn't an objective way of saying what their common probability is).

But I disagree that this makes VNM only applicable to settings in which all sources of uncertainty have objectively correct probabilities. As I said in my previous comment, you only need there to exist some source of objective probabilities, and you can then ...

14y

Let me repeat back your argument as I understand it.
If we have a Bayesian utility maximizing agent, that's just a probabilistic
inference layer with a VNM utility maximizer sitting on top of it. So our
would-be arbitrageur comes along with a source of "objective" randomness, like a
quantum random number generator. The arbitrageur wants to interact with the VNM
layer, so it needs to design bets to which the inference layer assigns some
specific probability. It does that by using the "objective" randomness source in
the bet design: just incorporate that randomness in such a way that the
inference layer assigns the probabilities the arbitrageur wants.
This seems correct insofar as it applies. It is a useful perspective, and not
one I had thought much about before this, so thanks for bringing it in.
The main issue I still don't see resolved by this argument is the architecture
question
[https://www.lesswrong.com/posts/osxNg6yBCJ4ur9hpi/does-agent-like-behavior-imply-agent-like-architecture].
The coherence theorems only say that an agent must act as if they perform
Bayesian inference and then choose the option with highest expected value based
on those probabilities. In the agent's actual internal architecture, there need
not be separate modules for inference and decision-making (a Kalman filter is
one example). If we can't neatly separate the two pieces somehow, then we don't
have a good way to construct lotteries with specified probabilities, so we don't
have a way to treat the agent as a VNM-type agent.
This directly follows from the original main issue: VNM utility theory is built
on the idea that probabilities live in the environment, not in the agent. If
there's a neat separation between the agent's inference and decision modules,
then we can redefine the inference module to be part of the environment, but
that neat separation need not always exist.
EDIT: Also, I should point out explicitly that VNM alone doesn't tell us why we
ever expect probabilities to be

I think you're underestimating VNM here.

only two of those four are relevant to coherence. The main problem is that the axioms relevant to coherence (acyclicity and completeness) do not say anything at all about probability

It seems to me that the independence axiom is a coherence condition, unless I misunderstand what you mean by coherence?

correctly point out problems with VNM

I'm curious what problems you have in mind, since I don't think VNM has problems that don't apply to similar coherence theorems.

VNM utility stipulates that agents h...

34y

I would argue that independence of irrelevant alternatives is not a real
coherence criterion. It looks like one at first glance: if it's violated, then
you get an Allais Paradox-type situation where someone pays to throw a switch
and then pays to throw it back. The problem is, the "arbitrage" of throwing the
switch back and forth hinges on the assumption that the stated probabilities are
objectively correct. It's entirely possible for someone to come along who
believes that throwing the switch changes the probabilities in a way that makes
it a good deal. Then there's no real arbitrage, it just comes down to whose
probabilities better match the outcomes.
My intuition for this not being real arbitrage comes from finance. In finance,
we'd call it "statistical arbitrage": it only works if the probabilities are
correct. The major lesson of the collapse of Long Term Capital Management
[https://en.wikipedia.org/wiki/Long-Term_Capital_Management] in the 90's is that
statistical arbitrage is definitely not real arbitrage. The whole point of true
arbitrage is that it does not depend on your statistical model being correct .
This directly leads to the difference between VNM and Bayesian expected utility
maximization. In VNM, agents have preferences over lotteries: the probabilities
of each outcome are inputs to the preference function. In Bayesian expected
utility maximization, the only inputs to the preference function are the choices
available to the agent - figuring out the probabilities of each outcome under
each choice is the agent's job.
(I do agree that we can set up situations where objectively correct
probabilities are a reasonable model, e.g. in a casino, but the point of
coherence theorems is to be pretty generally applicable. A theorem only relevant
to casinos isn't all that interesting.)

4y3

I do, however, believe that the single step cooperate-defect game which they use to come up with their factors seems like a very simple model for what will be a very complex system of interactions. For example, AI development will take place over time, and it is likely that the same companies will continue to interact with one another. Iterated games have very different dynamics, and I hope that future work will explore how this would affect their current recommendations, and whether it would yield new approaches to incentivizing cooperation.

It may be diff...

4y8

I object to the framing of the bomb scenario on the grounds that low probabilities of high stakes are a source of cognitive bias that trip people up for reasons having nothing to do with FDT. Consider the following decision problem: "There is a button. If you press the button, you will be given $100. Also, pressing the button has a very small (one in a trillion trillion) chance of causing you to burn to death." Most people would not touch that button. Using the same payoffs and probabilies in a scenario to challenge FDT thus exploits cognitive bi...

I don't know if I'm a simulation or a real person.

A possible response to this argument is that the predictor may be able to accurately predict the agent without explicitly simulating them. A possible counter-response to this is to posit that any sufficiently accurate model of a conscious agent is necessarily conscious itself, whether the model takes the form of an explicit simulation or not.

I think the counterfactuals used by the agent are the correct counterfactuals for someone else to use while reasoning about the agent from the outside, but not the correct counterfactuals for the agent to use while deciding what to do. After all, knowing the agent's source code, if you see it start to cross the bridge, it is correct to infer that it's reasoning is inconsistent, and you should expect to see the troll blow up the bridge. But while deciding what to do, the agent should be able to reason about purely causal effects of its counterfact...

12y

Suppose the bridge is safe iff there's a proof that the bridge is safe. Then you
would forbid the reasoning "Suppose I cross. I must have proven it's safe. Then
it's safe, and I get 10. Let's cross.", which seems sane enough in the face of
Löb.

34y

I agree with everything you say here, but I read you as thinking you disagree
with me.
Yeah, that's the problem I'm pointing at, right?
I think we just agree on that? As I responded
[https://www.lesswrong.com/posts/hpAbfXtqYC2BrpeiC/troll-bridge-5#YZMPzBkd7ZyEx9G6D]
to another comment here:

The agent could be programmed to have a certain hard-coded ontology rather than searching through all possible hypotheses weighted by description length.

14y

My point is, I don't think it's possible to implement a strong computationally
feasible agent which doesn't search through possible hypotheses, because solving
the optimization problem for the hard-coded ontology is intractable. In other
words, what gives intelligence its power is precisely the search through
possible hypotheses.

Are you worried about leaks from the abstract computational process into the real world, leaks from the real world into the abstract computational process, or both? (Or maybe neither and I'm misunderstanding your concern?)

There will definitely be tons of leaks from the abstract computational process into the real world; just looking at the result is already such a leak. The point is that the AI should have no incentive to optimize such leaks, not that the leaks don't exist, so the existence of additional leaks that we didn't know about shoul...

What I meant was that the computation isn't extremely long in the sense of description length, not in the sense of computation time. Also, we aren't doing policy search over the set of all turing machines, we're doing policy search over some smaller set of policies that can be guaranteed to halt in a reasonable time (and more can be added as time goes on)

Wouldn't the set of all action sequences have lower description length than some large finite set of policies? There's also the potential problem that all of the policies in the large finite set you're searching over could be quite far from optimal.

Ok, understood on the second assumption. is not a function to , but a function to the set of -valued random variables, and your assumption is that this random variable is uncorrelated with certain claims about the outputs of certain policies. The intuitive explanation of the third condition made sense; my complaint was that even with the intended interpretation at hand, the formal statement made no sense to me.

I'm pretty sure you're assuming that is resolved on day , not that it is resolved eventually.

Searching over the set of all ...

15y

Ah, the formal statement was something like "if the policy A isn't the argmax
policy, the successor policy B must be in the policy space of the future argmax,
and the action selected by policy A is computed so the relevant equality holds"
Yeah, I am assuming fast feedback that it is resolved on day n .
What I meant was that the computation isn't extremely long in the sense of
description length, not in the sense of computation time. Also, we aren't doing
policy search over the set of all turing machines, we're doing policy search
over some smaller set of policies that can be guaranteed to halt in a reasonable
time (and more can be added as time goes on)
Also I'm less confident in conditional future-trust for all conditionals than I
used to be, I'll try to crystallize where I think it goes wrong.

This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you're suggesting we build an AI without preferences, but it doesn't sound like that.)

Ok, here's another attempt to explain what I meant. Somewhere in the platonic realm of abstract mathematical structures, there is a small world with physics quite a lot like ours, containing an AI running on some idealized computational hardware, and trying to arrange the rest of the small world so that it has some desired property. Human...

15y

Ah, I see. That does make it seem clearer to me, though I'm not sure what
beliefs actually changed.

5y3

I suggest stating the result you're proving before giving the proof.

You have some unusual notation that I think makes some of this unnecessarily confusing. Instead of this underlined vs non-underlined thing, you should have different functions $ and , where the first maps action sequences to utilities, and the second maps a pair consisting of an action and a future policy to the utility of the action sequence beginning with , followed by , followed by the action sequence generated by . Your first assumption ...

15y

First: That notation seems helpful. Fairness of the environment isn't present by
default, it still needs to be assumed even if the environment is purely
action-determined, as you can consider an agent in the environment that is using
a hardwired predictor of what the argmax agent would do. It is just a piece of
the environment, and feeding a different sequence of actions into the
environment as input gets a different score, so the environment is purely
action-determined, but it's still unfair in the sense that the expected utility
of feeding action x into the function drops sharply if you condition on the
argmax agent selecting action x. The third condition was necessary to carry out
this step. En(U(a∗1:n−1,a?–––,a2:∞?))=En(U(a∗1:n−1,a1:∞?)) . The intuitive
interpretation of the third condition is that, if you know that policy B selects
action 4, then you can step from "action 4 is taken" to "policy B takes the
actions it takes", and if you have a policy where you don't know what action it
takes (third condition is violated), then "policy B does its thing" may have a
higher expected utility than any particular action being taken, even in a fair
environment that only cares about action sequences, as the hamster dance example
shows.
Second: I think you misunderstood what I was claiming. I wasn't claiming that
logical inductors attain the conditional future-trust property, even in the
limit, for all sentences or all true sentences. What I was claiming was: The
fact that ϕ is provable or disprovable in the future (in this case, ϕ is
‘‘a∗n=x"), makes the conditional future-trust property hold (I'm fairly sure),
and for statements where there isn't guaranteed feedback, the conditional
future-trust property may fail. The double-expectation property that you state
does not work to carry the proof through, because the proof (from the
perspective of the first agent), takes ϕ as an assumption, so the "conditional
on ϕ" part has to be outside of the future expectation, when

15y

Note: looks like you were trying to use markdown. To use markdown in our editor
you need to press cmd-4. (Originally the "$" notation worked, but people who
weren't familiar with LaTeX were consistently confused about to actually type a
dollar sign)

The model I had in mind was that the AI and the toy world are both abstract computational processes with no causal influence from our world, and that we are merely simulating/spectating on both the AI itself and the toy world it optimizes. If the AI messes with people simulating it so that they end up simulating a similar AI with more compute, this can give it more influence over these peoples' simulation of the toy world the AI is optimizing, but it doesn't give the AI any more influence over the abstract computational process that it (another a...

15y

This model seems very fatalistic, I guess? It seems somewhat incompatible with
an agent that has preferences. (Perhaps you're suggesting we build an AI without
preferences, but it doesn't sound like that.)
I think there's a lot of common sense that humans apply that allows them to
design solutions that meet many implicit constraints that they can't easily
verbalize. "Thinking outside of the box" is when a human manages to design
something that doesn't satisfy one of the constraints, because it turns out that
constraint wasn't useful. But in most cases, those constraints are very useful,
because they make the search space much smaller. By default, these constraints
won't carry over into the virtual world.
(Lots of examples of this in The Surprising Creativity of Digital Evolution: A
Collection of Anecdotes from the Evolutionary Computation and Artificial Life
Research Communities [https://arxiv.org/abs/1803.03453])

I agree. I didn't mean to imply that I thought this step would be easy, and I would also be interested in more concrete ways of doing it. It's possible that creating a hereditarily restricted optimizer along the lines I was suggesting could end up being approximately as difficult as creating an aligned general-purpose optimizer, but I intuitively don't expect this to be the case.

There should be a chat icon on the bottom-right of the screen on Alignment Forum that you can use to talk to the admins (unless only people who have already been approved can see this?). You can also comment on LW (Alignment Forum posts are automatically crossposted to LW), and ask the admins to make it show up on Alignment Forum afterwards.

15y

Apparently "You must be approved by an admin to comment on Alignment Forum", how
do I do this?
Also is this officially the successor to IAFF? If so it would be good to make
that more clear on this website.

I don't think that specifying the property of importance is simple and helps narrow down S. I think that in order for predicting S to be important, S must be generated by a simple process. Processes that take large numbers of bits to specify are correspondingly rarely occurring, and thus less useful to predict.

35y

I don't buy it. A camera that some robot is using to make decisions is no
simpler than any other place on Earth, just more important.
(This already gives the importance-weighted predictor a benefit of
~log(quadrillion))
Clearly you need to e.g. make the anthropic update and do stuff like that before
you have any chance of competing with the consequentialist. This might just be a
quantitative difference about how simple is simple---like I said elsewhere, all
the action is in the additive constants, I agree that the important things are
"simple" in some sense.

Suppose that I just specify a generic feature of a simulation that can support life + expansion (the complexity of specifying "a simulation that can support life" is also paid by the intended hypothesis, so we can factor it out). Over a long enough time such a simulation will produce life, that life will spread throughout the simulation, and eventually have some control over many features of that simulation.

Oh yes, I see. That does cut the complexity overhead down a lot.

Once you've specified the agent, it just samples randomly from the di...

I didn't mean that an agenty Turing machine would find S and then decide that it wants you to correctly predict S. I meant that to the extent that predicting S is commonly useful, there should be a simple underlying reason why it is commonly useful, and this reason should give you a natural way of computing S that does not have the overhead of any agency that decides whether or not it wants you to correctly predict S.

15y

How many bits do you think it takes to specify the property "people's
predictions about S, using universal prior P, are very important"?
(I think you'll need to specify the universal prior P by reference to the
universal prior that is actually used in the world containing the string S, if
you spell out the prior P explicitly you are already sunk just from the
ambiguity in the choice of language.)
It seems relatively unlikely to me that this will be cheaper than specifying
some arbitrary degree of freedom in a computationally rich universe that life
can control (+ the extra log(fraction of degrees of freedom the
consequentialists actually choose to control)). Of course it might.
I agree that the entire game is in the constants---what is the cheapest way to
pick out important strings.

This reasoning seems to rely on there being such strings S that are useful to predict far out of proportion to what you would expect from their complexity. But a description of the circumstance in which predicting S is so useful should itself give you a way of specifying S, so I doubt that this is possible.

15y

I agree. That’s what I meant when I wrote there will be TMs that artificially
promote S itself. However, this would still mean that most of S’s mass in the
prior would be due to these TMs, and not due to the natural generator of the
string.
Furthermore, it’s unclear how many TMs would promote S vs S’ or other
alternatives. Because of this, I don’t now whether the prior would be higher for
S or S’ from this reasoning alone. Whichever is the case, the prior no longer
reflects meaningful information about the universe that generates S and whose
inhabitants are using the prefix to choose what to do; it’s dominated by these
TMs that search for prefixes they can attempt to influence.

I think decision problems with incomplete information are a better model in which to measure optimization power than deterministic decision problems with complete information are. If the agent knows exactly what payoffs it would get from each action, it is hard to explain why it might not choose the optimal one. In the example I gave, the first agent could have mistakenly concluded that the .9-utility action was better than the 1-utility action while making only small errors in estimating the consequences of each of its actions, while the second agent would need to make large errors in estimating the consequences of its actions in order to think that the .1-utility action was better than the 1-utility action.

5y2

I'm not convinced that the probability of S' could be pushed up to anything near the probability of S. Specifying an agent that wants to trick you into predicting S' rather than S with high probability when you see their common prefix requires specifying the agency required to plan this type of deception (which should be quite complicated), and specifying the common prefix of S and S' as the particular target for the deception (which, insofar as it makes sense to say that S is the "correct" continuation of the prefix, should h...

35y

Suppose that I just specify a generic feature of a simulation that can support
life + expansion (the complexity of specifying "a simulation that can support
life" is also paid by the intended hypothesis, so we can factor it out). Over a
long enough time such a simulation will produce life, that life will spread
throughout the simulation, and eventually have some control over many features
of that simulation.
Once you've specified the agent, it just samples randomly from the distribution
of "strings I want to influence." That has a way lower probability than the
"natural" complexity of a string I want to influence. For example, if
1/quadrillion strings are important to influence, then the attackers are able to
save log(quadrillion) bits.

15y

I agree that this probably happens when you set out to mess with an arbitrary
particular S, I.e. try to make some S’ that shares a prefix with S as likely as
S.
However, some S are special, in the sense that their prefixes are being used to
make very important decisions. If you, as a malicious TM in the prior, perform
an exhaustive search of universes, you can narrow down your options to only a
few prefixes used to make pivotal decisions, selecting one of those to mess with
is then very cheap to specify. I use S to refer to those strings that are the
‘natural’ continuation of those cheap-to-specify prefixes.
There are, it seems to me, a bunch of other equally-complex TMs that want to
make other strings that share that prefix more likely, including some that
promote S itself. What the resulting balance looks like is unclear to me, but
what’s clear is that the prior is malign with respect to that prefix -
conditioning on that prefix gives you a distribution almost entirely controlled
by these malign TMs. The ‘natural’ complexity of S, or of other strings that
share the prefix, play almost no role in their priors.
The above is of course conditional on this exhaustive search being possible,
which also relies on there being anyone in any universe that actually uses the
prior to make decisions. Otherwise, we can’t select the prefixes that can be
messed with.

The multi-armed bandit problem is a many-round problem in which actions in early rounds provide information that is useful for later rounds, so it makes sense to explore to gain this information. That's different from using exploration in one-shot problems to make the counterfactuals well-defined, which is a hack.

5y3

Some undesirable properties of C-score:

It depends on how the space of actions are represented. If a set of very similar actions that achieve the same utility for the agent are merged into one action, this will change the agent's C-score.

It does not depend on the magnitudes of the agent's preferences, only on their orderings. Compare 2 agents: the first has 3 available actions, which would give it utilities 0, .9, and 1, respectively, and it picks the action that would give it utility .9. The second has 3 available actions, which would give it uti...

25y

I agree with the first point, and I don't have solid solutions to this. There's
also the fact that some games are easier to optimize than others (name a number
game I described at the end vs. chess), and this complexity is impossible to
capture while staying computation-agnostic. Maybe one can use the length of the
shortest proof that taking action a leads to utility u(a) to account for these
issues..
The second point is more controversial, my intuition is that first agent is an
equally good optimizer, even if it did better in terms of payoffs. Also, at
least in the setting of deterministic games, utility functions are arbitrary up
to encoding the same preference orderings (once randomness is introduced this
stops being true)

5y1

A related question is, whether it is possible to design an algorithm for strong AI based on simple mathematical principles, or whether any strong AI will inevitably be an enormous kludge of heuristics designed by trial and error. I think that we have some empirical support for the former, given that humans evolved to survive in a certain environment but succeeded to use their intelligence to solve problems in very different environments.

I don't understand this claim. It seems to me that human brains appear to be "an enormous kludge of heuristics designed by trial and error". Shouldn't the success of humans be evidence for the latter?

05y

The fact that the human brain was designed by trial and error is a given.
However, we don't really know how the brain works. It is possible that the brain
contains a simple mathematical core, possibly implemented inefficiently and with
bugs and surrounded by tonnes of legacy code, but nevertheless responsible for
the broad applicability of human intelligence.
Consider the following two views (which might also admit some intermediates):
View A: There exists a simple mathematical algorithm M that corresponds to what
we call "intelligence" and that allows solving any problem in some very broad
natural domain D.
View B: What we call intelligence is a collection of a large number of unrelated
algorithms tailored to individual problems, and there is no "meta-algorithm"
that produces them aside from relatively unsophisticated trial and error.
If View B is correct, then we expect that doing trial and error on a collection
X of problems will produce an algorithm that solves problems in X and almost
only in X. The probability that you were optimizing for X but solved a much
larger domain Y is vanishingly small: it is about the same as the probability of
a completely random algorithm to solve all problems in Y∖X.
If View A is correct, then we expect that doing trial and error on X has a
non-negligible chance of producing M (since M is simple and therefore sampled
with a relatively large probability), which would be able to solve all of D.
So, the fact that homo sapiens evolved in a some prehistoric environment but was
able to e.g. land on the moon should be surprising to everyone with View B but
not surprising to those with View A.

Oh, derp. You're right.