# All of Michele Campolo's Comments + Replies

Naturalism and AI alignment

I am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment.

In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief... (read more)

Naturalism and AI alignment

If there is a superintelligent AI that ends up being aligned as I've written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough.

From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.

3Daniel Kokotajlo15dI think this is an interesting point -- but I don't conclude optimism from it as you do. Humans engage in explicit reasoning about what they should do, and they theorize and systematize, and some of them really enjoy doing this and become philosophers so they can do it a lot, and some of them conclude things like "The thing to do is maximize total happiness" or "You can do whatever you want, subject to the constraint that you obey the categorical imperative" or as you say "everyone should care about conscious experiences." The problem is that every single one of those theories developed so far has either been (1) catastrophically wrong, (2) too vague, or (3) relative to the speaker's intuitions somehow (e.g. intuitionism). By "catastrophically wrong" I mean that if an AI with control of the whole world actually followed through on the theory, they would kill everyone or do something similarly bad. (See e.g. classical utilitarianism as the classic example of this). Basically... I think you are totally right that some of our early AI systems will do philosophy and come to all sorts of interesting conclusions, but I don't expect them to be the correct conclusions. (My metaethical views may be lurking in the background here, driving my intuitions about this... see Eliezer's comment) Do you have an account of how philosophical reasoning in general, or about morality in particular, is truth-tracking? Can we ensure that the AIs we build reason in a truth-tracking way? If truth isn't the right concept for thinking about morality, and instead we need to think about e.g. "human values" or "my values," then this is basically a version of the alignment problem.
Naturalism and AI alignment

1 From Arbital:

The Orthogonality Thesis states "there exists at least one possible agent such that..."

Also my claim is an existential claim, and I find it valuable because it could be an opportunity to design aligned AI.

2 Arbital claims that orthogonality doesn't require moral relativism, so it doesn't seem incompatible with what I am calling naturalism in the post.

3 I am ok with rejecting positions similar to what Arbital calls universalist moral internalism. Statements like "All agents do X" cannot be exact.

Naturalism and AI alignment

I am aware of interpretability issues. This is why, for AI alignment, I am more interested in the agent described at the beginning of Part II than Scientist AI.

Thanks for the link to the sequence on concepts, I found it interesting!

1Charlie Steiner23dWow, I'm really sorry for my bad reading comprehension. Anyhow, I'm skeptical that scientist AI part 2 would end up doing the right thing (regardless of our ability to interpret it). I'm curious if you think this could be settled without building a superintelligent AI of uncertain goals, or if you'd really want to see the "full scale" test.
Decision Theory is multifaceted

Ok, if you want to clarify—I'd like to—we can have a call, or discuss in other ways. I'll contact you somewhere else.

Decision Theory is multifaceted
Omega, a perfect predictor, flips a coin. If it comes up heads, Omega asks you for $100, then pays you$10,000 if it predict you would have paid if it had come up tails and you were told it was tails. If it comes up tails, Omega asks you for $100, then pays you$10,000 if it predicts you would have paid if it had come up heads and you were told it was heads.

Here there is no question, so I assume it is something like: "What do you do?" or "What is your policy?"

That formulation is analogous to standard counterfactual mugging, stated in th... (read more)

1Chris_Leong8moTo be honest, this thread has gone on long enough that I think we should end it here. It seems to me that you are quite confused about this whole issue, though I guess from your perspective it seems like I am the one who is confused. I considered asking a third person to try looking at this thread, but I decided it wasn't worth calling in a favour. I made a slight edit to my description of Counterfactual Prisoner's Dilemma, but I don't think this will really help you understand:
Decision Theory is multifaceted

It seems you are arguing for the position that I called "the first intuition" in my post. Before knowing the outcome, the best you can do is (pay, pay), because that leads to 9900.

On the other hand, as in standard counterfactual mugging, you could be asked: "You know that, this time, the coin came up tails. What do you do?". And here the second intuition applies: the DM can decide to not pay (in this case) and to pay when heads. Omega recognises the intent of the DM, and gives 10000.

Maybe you are not even considering the second intuitio... (read more)

1Chris_Leong8moI am considering the second intuiton. Acting according to it results in you receiving $0 in Counterfactual Prisoner's Dilemma, instead of losing$100. This is because if you act updatefully when it comes up heads, you have to also act updatefully when it comes up tails. If this still doesn't make sense, I'd encourage you to reread the post.
Decision Theory is multifaceted

If the DM knows the outcome is heads, why can't he not pay in that case and decide to pay in the other case? In other words: why can't he adopt the policy (not pay when heads; pay when tails), which leads to 10000?

1Chris_Leong8moIf you pre-commit to that strategy (heads don't post, tails pay) it provides 10000, but it only works half the time. If you decide that after you see the coin, not to pay in that case, then this will lead to the strategy (not pay, not pay) which provides 0.
Decision Theory is multifaceted

The fact that it is "guaranteed" utility doesn't make a significant difference: my analysis still applies. After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition).

1Chris_Leong8mo"After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition)" - No you can't. The only way to get 10,000 is to pay if the coin comes up the opposite way it comes up. And that's only a 50/50 chance.
Decision Theory is multifaceted

Hi Chris!

Suppose the predictor knows that it writes M on the paper you'll choose N and if it writes N on the paper you'll choose M. Further, if it writes nothing you'll choose M. That isn't a problem since regardless of what it writes it would have predicted your choice correctly. It just can't write down the choice without making you choose the opposite.

My point in the post is that the paradoxical situation occurs when the prediction outcome is communicated to the decision maker. We have a seemingly correct prediction—the ... (read more)

1Chris_Leong8moWell, you can only predict conditional on what you write, you can't predict unconditionally. However, once you've fixed what you'll write in order to make a prediction, you can't then change what you'll write in response to that prediction. Actually, it isn't about utility in expectation. If you are the kind of person who pays you gain $9900, if you aren't you gain$100. This is guaranteed utility, not expected utility.
Goals and short descriptions

I wouldn't say goals as short descriptions are necessarily "part of the world".

Anyway, locality definitely seems useful to make a distinction in this case.

Goals and short descriptions

No worries, I think your comment still provides good food for thought!

Goals and short descriptions

I'm not sure I understand the search vs discriminative distinction. If my hand touches fire and thus immediately moves backwards by reflex, would this be an example of a discriminative policy, because an input signal directly causes an action without being processed in the brain?

About the goal of winning at chess: in the case of minimax search, generates the complete tree of the game using and then selects the winning policy; as you said, this is probably the simplest agent (in terms of Kolmogorov complexity, given ) that wins at chess—an... (read more)

2Steve Byrnes10moHmm, maybe we're talking past each other. Let's say I have something like AlphaZero, where 50,000 bytes of machine code trains an AlphaZero-type chess-playing agent, whose core is a 1-billion-parameter ConvNet. The ConvNet takes 1 billion bytes to specify. Meanwhile, the reward-calculator p, which calculates whether checkmate has occurred, is 100 bytes of machine code. Would you say that the complexity of the trained chess-playing agent is 100 bytes or 50,000 bytes or 1 billion bytes? I guess you're going to say 50,000, because you're imagining a Turing machine that spends a year doing the self-play to calculate the billion-parameter ConvNet and then immediately the same Turning machine starts running that ConvNet it just calculated. From the perspective of Kolmogorov complexity, it doesn't matter that it spends a year calculating the ConvNet, as long as it does so eventually. By the same token, you can always turn a search-y agent into an equivalent discriminitive-y agent, given infinite processing time and storage, by training the latter on a googol queries of the former. If you're thinking about Kolmogorov complexity, then you don't care about a googol queries, as long as it works eventually. Therefore, my first comment is not really relevant to what you're thinking about. Sorry. I was not thinking about algorithms-that-write-arbitrary-code-and-then-immediately-run-it, I was thinking about the complexity of the algorithms that are actually in operation as the agent acts in the world. Yes. But the lack-of-processing-in-the-brain is not the important part. A typical ConvNet image classifier does involve many steps of processing, but is still discriminative, not search-y, because it does not work by trying out different generative models and picks the one that best explains the data. You can build a search-y image classifier that does exactly that [https://arxiv.org/abs/1805.09190], but most people these days don't.
Goals and short descriptions

The others in the AISC group and I discussed the example that you mentioned more than once. I agree with you that such an agent is not goal-directed, mainly because it doesn't do anything to ensure that it will be able to perform action A even if adverse events happen.

It is still true that action A is a short description of the behaviour of that agent and one could interpret action A as its goal, although the agent is not good at pursuing it ("robustness" could be an appropriate term to indicate what the agent is lacking).

1Adam Shimi10moMaybe the criterion that removes this specific policy is locality [https://www.alignmentforum.org/s/DTnoFhDm7ZT2ecJMw/p/HkWB5KCJQ2aLsMzjt]? What I mean is that this policy has a goal only on its output (which action it chooses), and thus a very local goal. Since the intuition of goals as short descriptions assumes that goals are "part of the world", maybe this only applies to non-local goals.
Dutch-Booking CDT: Revised Argument

The part that I don't get is the reason why the agent is betting ahead of time implies evaluation according to edt, while the agent is reasoning during its action implies evaluation according to cdt. Sorry if I'm missing something trivial, but I'd like to receive an explanation because this seems a fundamental part of the argument.

3Abram Demski1yOh right, OK. That's because of the general assumption that rational agents bet according to their beliefs. If a CDT agent doesn't think of a bet as intervening on a situation, then when betting ahead of time, it'll just bet according to its probabilities. But during the decision, it is using the modified (interventional) probabilities. That's how CDT makes decisions. So any bets which have to be made simultaneously, as part of the decision, will be evaluated according to those modified beliefs.
Dutch-Booking CDT: Revised Argument

I've noticed that one could read the argument and say: "Ok, an agent evaluates a parameter U differently at different times. Thus, a bookmaker exploits the agent with a bet/certificate whose value depends on U. What's special about this?"

Of course the answer lies in the difference between cdt(a) and edt(a), specifically you wrote:

The key point here is that because the agent is betting ahead of time, it will evaluate the value of this bet according to the conditional expectation E(U|Act=a).

and

Now, since the agent is reasoning during its