Stuart Armstrong

Stuart Armstrong's Comments

ACDT: a hack-y acausal decision theory

If the predictor is near-perfect, but the agent models itself as having access to unpredictable randomness, then the agent will continually try to randomize (which it calculates has expected utility 1), and will continually lose.

It's actually worse than that for CDT; the agent is not actually trying to randomise, it is compelled to model the predictor as a process that is completely disconnected from its own actions, so it can freely pick the action that the predictor is least likely to pick - according to the CDT's modelling of it. Or pick zero in the case of a tie. So the CDT agent is actually deterministic, and even if you gave it a source of randomness, it wouldn't see any need to use it.

The problem with the previous agent is that it never learns that it has the wrong causal model. If the agent is able to learn a better causal model from experience, then it can learn that it is not actually able to use unpredictable randomness, and so it will no longer expect a 50% chance of winning, and it will stop playing the game.

[...] then it can learn that the predictor can actually predict the agent successfully, and so will no longer expect a 50% [...]

Preference synthesis illustrated: Star Wars

"I want to be a more generous person": what would you classify that as? Or "I want to want to write"?

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

Flo's summary for the Alignment Newsletter:

I like that summary!

I enjoyed this article and the proposed factors match my intuitions. Predicting variable diminishing returns seems especially hard to me. I also worry that the interactions between rewards will be negative-sum, due to resource constraints.

Resource constraints situations can be positive sum (consider most of the economy). The real problem is between antagonistic preferences, eg maximising flourishing lives vs negative utilitarianism, where a win for one is a loss for the other.

Note that this post considers the setting where we have uncertainty over the true reward function, but we can't learn about the true reward function. If you can gather information about the true reward function, which <@seems necessary to me@>(@Human-AI Interaction@), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.

Yes, if you're in a learning process and treat it as if you weren't in a learning process, things will go wrong ^_^

Breaking Oracles: superrationality and acausal trade

Hum - my approach here seems to have a similarity to your idea.

A Critique of Functional Decision Theory

I have to say, I find these criticisms a bit weak. Going through them:

III. FDT sometimes makes bizarre recommendations

I'd note that successfully navigating Parfit's hitchhiker also involve violating "Guaranteed Payoffs": you pay the driver at a time when there is no uncertainty, and where you get better utility from not doing so. So I don't think Guaranteed Payoffs is that sound a principle.

Your bomb example is a bit underdefined, since the predictor is predicting your actions AND giving you the prediction. If the predictor is simulating you and asking "would you go left after reading a prediction that you are going right", then you should go left; because, by the probabilities in the setup, you are almost certainly a simulation (this is kind of a "counterfactual Parfit hitchhiker" situation).

If the predictor doesn't simulate you, and you KNOW they said to go right, you are in a slightly different situation, and you should go right. This is akin to waking up in the middle of the Parfit hitchhiker experiment, when the driver has already decided to save you, and deciding whether to pay them.

IV. FDT fails to get the answer Y&S want in most instances of the core example that’s supposed to motivate it

This section is incorrect, I think. In this variant, the contents of the boxes are determined not by your decision algorithm, but by your nationality. And of course two-boxing is the right decision in that situation!

the case for one-boxing in Newcomb’s problem didn’t seem to stem from whether the Predictor was running a simulation of me, or just using some other way to predict what I’d do.

But it does depend on things like this. There's no point in one-boxing unless your one-boxing is connected with the predictor believing that you'd one-box. In a simulation, that's the case; in some other situations where the predictor looks at your algorithm, that's also the case. But if the predictor is predicting based on nationality, then you can freely two-box without changing the predictor's prediction.

V. Implausible discontinuities

There's nothing implausible about discontinuity in the optimal policy, even if the underlying data is continuous. If is the probability that we're in a smoking lesion vs a Newcomb problem, then as changes from to , the expected utility of one-boxing falls and the expected utility of two-boxing rises. At some point, the optimal action will jump discontinuously from one to the other.

VI. FDT is deeply indeterminate

I agree FDT is indeterminate, but I don't agree with your example. Your two calculators are clearly isomorphic, just as if we used a different numbering system for one versus the other. Talking about isomorphic algorithms avoids worrying about whether they're the "same" algorithm.

And in general, it seems to me, there’s no fact of the matter about which algorithm a physical process is implementing in the absence of a particular interpretation of the inputs and outputs of that physical process.

Indeed. But since you and your simulation are isomorphic, you can look at what the consequences are of you outputting "two-box" while your simulation outputs "deux boites" (or "one-box" and "une boite"). And {one-box, une boite} is better than {two-box, deux boites}.

But why did I use those particular interpretations of me and my simulation's physical processes? Because those interpretations are the ones relevant to the problem at hand. Me and my simulation will have a different weight, consume different amounts of power, are run at different times, and probably at different speeds. If those were relevant to the Newcomb problem, then the fact we are different becomes relevant. But since they aren't, we can focus in on the core of the matter. (you can also consider the example of playing the prisoner's dilemma against an almost-but-not-quite-identical copy of yourself).

2018 AI Alignment Literature Review and Charity Comparison

Very thorough, and it's very worthwhile that posts like this are made.

Load More