gjm

Hi. I'm Gareth McCaughan. I've been a consistent reader and occasional commenter since the Overcoming Bias days. My LW username is "gjm" (not "Gjm" despite the wiki software's preference for that capitalization). Elsewehere I generally go by one of "g", "gjm", or "gjm11". The URL listed here is for my website and blog, neither of which has been substantially updated in about the last four years. I live near Cambridge (UK) and work for a small technology company in Cambridge. My business cards say "mathematician" but in practice my work is a mixture of simulation, data analysis, algorithm design, software development, problem-solving, and whatever random engineering no one else is doing. I am married and have a daughter born in mid-2006. The best way to contact me is by email: firstname dot lastname at pobox dot com. I am happy to be emailed out of the blue by interesting people. If you are an LW regular you are probably an interesting person in the relevant sense even if you think you aren't.

If you're wondering why some of my old posts and comments are at surprisingly negative scores, it's because for some time I was the favourite target of old-LW's resident neoreactionary troll, sockpuppeteer and mass-downvoter.

[Book Review] "The Alignment Problem" by Brian Christian

Thanks! (I would not have guessed correctly.)

[Book Review] "The Alignment Problem" by Brian Christian

It would add some possibly-useful context to this review if you explained *why* you came to it with an axe to grind. (Just as race is both possibly-useful information and a possible source of prejudice to correct for, so also with your prior prejudices about this book.)

My Current Take on Counterfactuals

OK, I get it. (Or at least I think I do.) And, duh, indeed it turns out (as you were too polite to say in so many words) that I was distinctly confused.

So: Using ordinary conditionals in planning your actions commits you to reasoning like "If (here in the actual world it turns out that) I choose to smoke this cigarette, then that makes it more likely that I have the weird genetic anomaly that causes both desire-to-smoke and lung cancer, so I'm more likely to die prematurely and horribly of lung cancer, so I shouldn't smoke it", which makes wrong decisions. So you want to use some sort of conditional that doesn't work that way and rather says something more like "suppose everything about the world up to now is exactly as it is in the actual world, but magically-but-without-the-existence-of-magic-having-consequences I decide to do X; what then?". And *this* is what you're calling decision-theoretic counterfactuals, and the question is exactly what they should be; EDT says no, just use ordinary conditionals, CDT says pretty much what I just said, etc. The "smoking lesion" shows that EDT can give implausible results; "Death in Damascus" shows that CDT can give implausible results; etc.

All of which I really should have remembered, since it's all stuff I have known in the past, but I am a doofus. My apologies.

(But my error wasn't being too mired in EDT, or at least I don't think it was; I think EDT is wrong. My error was having the term "counterfactual" too strongly tied in my head to what you call linguistic counterfactuals. Plus not thinking clearly about any of the actual decision theory.)

It still *feels* to me as if your proof-based agents are unrealistically narrow. Sure, they can incorporate whatever beliefs they have about the real world as axioms for their proofs -- but only if those axioms end up being consistent, which means having perfectly consistent beliefs. The beliefs may of course be probabilistic, but then that means that all those beliefs have to have perfectly consistent probabilities assigned to them. Do you really think it's plausible that an agent capable of doing real things in the real world can have perfectly consistent beliefs in this fashion? (I am pretty sure, for instance, that no human being has perfectly consistent beliefs; if any of us tried to do what your proof-based agents are doing, we would arrive at a contradiction -- or fail to do so only because we weren't trying hard enough.) I think "agents that use logic at all *on the basis of beliefs about the world that are perfectly internally consistent*" is a much narrower class than "agents that use logic at all".

(That probably sounds like a criticism, but once again I am extremely aware that it may be that this feels implausible to me only because I am lacking important context, or confused about important things. After all, that was the case last time around. So my question is more "help me resolve my confusion" than "let me point out to you how the stuff you've been studying for ages is wrongheaded", and I appreciate that you may have other more valuable things to do with your time than help to resolve my confusion :-).)

My Current Take on Counterfactuals

I agree that much of what's problematic about the example I gave is that the "inner" counterfactuals are themselves unclear. I was thinking that this makes the nested counterfactual harder to make sense of (exactly because it's unclear what connection there might be between them) but on reflection I think you're right that this isn't really about counterfactual nesting and that if we picked other poorly-defined (non-counterfactual) propositions we'd get a similar effect: "If it were morally wrong to eat shellfish, would humans Really Truly Have Free Will?" or whatever.

I'd not given any thought before to your distinction between linguistic and decision-theoretic counterfactuals. I'm actually not sure I understand the distinction. It's obvious how ordinary conditionals are important for planning and acting (you design a bridge so that it won't fall down *if* someone drives a heavy lorry across it; you don't cross a bridge because you think the troll underneath will eat you *if* you cross), but counterfactuals? I mean, obviously you can *put them in* to a particular problem: you're crossing a bridge and there's a troll who'll blow up the bridge if you would have crossed it if there'd been a warning sign saying "do not cross", or whatever. But that's not counterfactuals being useful for decision theory, it's some agent arbitrarily caring about counterfactuals -- and agents can arbitrarily care about *anything*. (I am not entirely sure I've understood the "Troll Bridge" example you're actually using, but to whatever extent it's about counterfactuals it seems to be of this "agent arbitrarily caring about counterfactuals" type.) The thing you call "proof-based decision theory" involves trying to prove things of the form "if I do X, I will get at least Y utility" but those look like ordinary conditionals rather than counterfactuals to me too. (And in any case the whole idea of doing what you can *rigorously prove from a given set of mathematical axioms* gives you the most guaranteed utility seems bonkers to me anyway as anything other than a toy example, though this is pure prejudice and maybe there are better reasons for it than I can currently imagine: we want agents that can act in the actual world, about which one can generally prove *precisely nothing* of interest.) Could you give a couple of examples where counterfactuals are relevant to planning and acting without having been artificially inserted?

It may just be that none of this should be expected to make sense to someone not already immersed in the particular proof-based-decision-theory framework I think you're working in, and that what I need in order to appreciate where you're coming from is to spend a few hours (days? weeks?) getting familiar with that. At any rate, right now "passing Troll Bridge" looks to me like a problem applicable only to a *very specific kind* of decision-making agent, one I don't see any particular reason to think has any prospect of ever being relevant to decision-making in the actual world -- but I am extremely aware that this may be purely a reflection of my own ignorance.

My Current Take on Counterfactuals

I never found Stalnaker's thesis at all plausible, not because I'd thought of the ingenious little calculation you give but because it just seems *obviously wrong* intuitively. But I suppose if you don't have any presuppositions about what sort of notion an implication is allowed to be, you don't get to reject it on those grounds. So I wasn't really entitled to say "Pr(A|B) is not the same thing as Pr(B=>A) for any particular notion of implication", since I hadn't thought of that calculation.

Anyway, I have just the same sense of *obvious wrongness* about this counterfactual version of Stalnaker. I suspect it's harder to come up with an outright refutation, not least because there isn't anything like general agreement about what C(A|B) means, whereas there's something much nearer to that for Pr(A|B).

At least some "nestings" of counterfactuals feel problematic to me. "Suppose it were true that if Bach had lived to be 90 then Mozart would have died at age 10; then if Dirichlet had lived to be 80, would Jacobi have died at 20?" The antecedent doesn't do much to make clear just what is actually being supposed, and it's not clear that this is made much better if we say instead "Suppose you believe, with credence 0.9, that if Bach had lived to be 90 then Mozart would have died at age 10; then how strongly do you believe that if Dirichlet had lived to be 80 then Jacobi would have died at 20?". But I do think that a good analysis of counterfactuals should allow for questions of this form. (But, just as some conditional probabilities are 0/0 and some others are small/small and we shouldn't trust our estimates much, some counterfactual probabilities are undefined or ill-conditioned. Whether or not they are actually literal ratios.)

My Current Take on Counterfactuals

How confident are you that the "right" counterfactual primitive is something like your C(A|B) meaning (I take it) "if B were the case then A would be the case"?

The alternative I have in mind assimilates counterfactual conditionals to *conditional probabilities* rather than to *logical implications*, so in addition to your existing Pr(A|B)=... meaning "if B is the case, then here's how strongly I expect A to be the case" there's Prc(A|B)=... meaning "if B were the case -- even though that might require the world to be different from how it actually is -- then here's how strongly I expect that A would be the case"?

In some ways this feels more natural to me, and like a better fit for your general observation that we shouldn't expect there to be One True Set Of Counterfactuals, and like a better fit for your suggestion that counterfactual conditions involve something like updating on evidence.

Typical philosophical accounts of counterfactuals say things like: "if B were the case then A would be the case" means that you look at the *nearest possible world* where B is the case, and see whether A holds there; this seems like it involves making a very specific choice too early, and it would be better to look at *nearby possible worlds* where B is the case and see how much of the time A holds. (I am not claiming that possible worlds are The Right Way to approach counterfactuals, just saying that *if* we approach them that way *then* we should probably not be jumping to a single possible world as soon as we consider a counterfactual. Not least because that makes combining different counterfactuals worse than it seems like it needs to; if "c-if A then B" and "c-if C then D", the "nearest possible world" approach doesn't let us say *anything* about what c-if A and C, because the nearest world where A, the nearest world where C, and the nearest world where A&C can all be entirely different. Whereas we might hope that when A and C are sufficiently compatible there'll at least be substantial overlap between the worlds where A, the worlds where C, and the worlds where A&C.

(I don't think it's enough to think of this in terms of applying already-existing probabilities to propositions like "c-if B then A", just as Pr(A|B) is not the same thing as Pr(B => A) for any particular notion of implication.)

Open problem: how can we quantify player alignment in 2x2 normal-form games?

I'm not 100% sure I am understanding your terminology. What does it mean to "play stag against (stag,stag)" or to "defect against cooperate/cooperate"?

If your opponent is not in any sense a utility-maximizer then I don't think it makes sense to talk about your opponent's utilities, which means that it doesn't make sense to have a payout matrix denominated in utility, which means that we are not in the situation of my second paragraph above ("The meaning generally assumed in game theory...").

We might be in the situation of my last-but-two paragraph ("Or maybe we're playing a game in which..."): the payouts might be something other than utilities. Dollars, perhaps, or just numbers written on a piece of paper. In that case, all the things I said about that situation apply here. In particular, I agree that it's then reasonable to ask "how aligned is B with A's interests?", but I think this question is largely decoupled from the specific game and is more about the mapping from (A's payout, B's payout) to (A's utility, B's utility).

I guess there are cases where that isn't enough, where A's and/or B's utility is not a function of the payouts alone. Maybe A just likes saying the word "defect". Maybe B likes to be seen as the sort of person who cooperates. Etc. But at this point it feels to me as if we've left behind most of the simplicity and elegance that we might have hoped to bring by adopting the "two-player game in normal form" formalism in the first place, and if you're prepared to consider scenarios where A just likes choosing the top-left cell in a 2x2 array then you also need to consider ones like the ones I described earlier in this paragraph -- where in fact it's *not* just the 2x2 payout matrix that matters but potentially any arbitrary details about what words are used when playing the game, or who is watching, or anything else. So if you're trying to get to the essence of alignment by considering simple 2x2 games, I think it would be best to leave that sort of thing out of it, and in that case my feeling is that your options are (a) to treat the payouts as actual utilities (in which case, once again, I agree with Dagon and think all the alignment information is in the payout matrix), or (b) to treat them as mere utility-function-fodder, but to assume that they're *all* the fodder the utility functions get (in which case, as above, I think *none* of the alignment information is in the payout matrix and it's all in the payouts-to-utilities mapping), or (c) to consider some sort of iterated-game setup (in which case, I think you need to nail down *what* sort of iterated-game setup before asking how to get a measure of alignment out of it).

Open problem: how can we quantify player alignment in 2x2 normal-form games?

I think "X and Y are playing a game of stag hunt" has multiple meanings.

The meaning generally assumed in game theory when considering just a single game is that the outcomes in the game matrix are utilities. In that case, I completely agree with Dagon: if on some occasion you prefer to pick "hare" even though you know I will pick "stag", then *we are not actually playing the stag hunt game*. (Because part of what it *means* to be playing stag hunt rather than some other game is that we both consider (stag,stag) the best outcome.)

But there are some other situations that might be described by saying that X and Y are playing stag hunt.

Maybe we are playing an iterated stag hunt. Then (by definition) what I care about is still some sort of aggregation of per-round outcomes, and (by definition) each round's outcome still has (stag,stag) best for me, etc. -- but now I need to strategize over the whole course of the game, and e.g. maybe I think that on a particular occasion choosing "hare" when you chose "stag" will make you understand that you're being punished for a previous choice of "hare" and make you more likely to choose "stag" in future.

Or maybe we're playing an *iterated* iterated stag hunt. Now maybe I choose "hare" when you chose "stag", knowing that it will make things worse for me over subsequent rounds, but hoping that *other people* looking at our interactions will learn the rule Don't Fuck With Gareth and never, ever choose anything other than "stag" when playing with me.

Or maybe we're playing a game in which the stag hunt matrix describes *some sort of payouts that are not exactly utilities*. E.g., we're in a psychology experiment and the experimenter has shown us a 2x2 table telling us how many *dollars* we will get in various cases -- but maybe I'm a billionaire and literally don't care whether I get $1 or $10 and figure I might as well try to maximize *your* payout, or maybe you're a perfect altruist and (in the absence of any knowledge about our financial situations) you just want to maximize the total take, or maybe I'm actually evil and want you to do as badly as possible.

In the iterated cases, it seems to me that the payout matrix still determines alignment *given the iteration context* -- how many games, with what opponents, with what aggregation of per-round utilities to yield overall utility (in prospect or in retrospect; the former may involve temporal discounting too). If I don't consider a long string of (stag,stag) games optimal then, again, we are not really playing (iterated) stag hunt.

In the payouts-aren't-really-utilities case, I think it *does* make sense to ask about the players' alignment, in terms of how they translate payouts into utilities. But ... it feels to me as if this is now basically separate from the actual game itself: the thing we might want to map to a measure of alignedness is something like the function from (both players' payouts) to (both players' utilities). The choice of game may then affect how far *unaligned players imply unaligned actions*, though. (In a game with (cooperate,defect) options where "cooperate" is always much better for the player choosing it than "defect", the payouts->utilities function would need to be badly anti-aligned, with players actively preferring to harm one another, in order to get uncooperative actions; in a prisoners' dilemma, it suffices that it not be strongly *aligned*; each player can slightly prefer the other to do better but still choose defection.)

Topological Fixed Point Exercises

Inappropriately highbrow proof of #4 (2d Sperner's lemma):

This proves a generalization: any number of dimensions, and any triangulation of the simplex in question. So, the setup is as follows. We have an n-dimensional simplex, defined by n+1 points in n-dimensional space. We colour the vertices with n+1 different colours. Then we triangulate it -- chop it up into smaller simplexes -- and we extend our colouring somehow in such a way that the vertices on any face (note: a face is the thing spanned by any subset of the vertices) of the big simplex are coloured using only the colours from the vertices that span that face. And the task is to prove that there are an odd number of little simplexes whose vertices have all n+1 colours.

This colouring defines a map from the vertices of the triangulation to the vertices of the big simplex: map each triangulation-vertex to the simplex-vertex that's the same colour. We can extend this map to the rest of each little simplex by linear interpolation. The resulting thing is continuous on the whole of the big simplex, so we have a continuous map (call it *f*) from the big simplex to itself. And we want to prove that we have an odd number of little simplices whose image under *f* spans the whole thing. (Call these "good" simplices.)

We'll do it with two ingredients. The easy one is *induction*: when proving this in n dimensions we shall assume we already proved it for smaller numbers of dimensions. The harder one is *homology*, a standard tool in algebraic topology. More precisely we'll do *homology mod 2*. It associates with each topological space X and each dimension d an abelian group Hd(X), and the key things you need to know are (1) that if you have f : X -> Y then you get an associated group homomorphism f* : Hd(X) -> Hd(Y), (2) that Hd(a simplex) is the cyclic group of order 2 if d=0, and the trivial group otherwise, and (3) that Hd(the boundary of a simplex) is the cyclic group of order 2 if d=0 or d = (dimension of simplex - 1) and the trivial group otherwise. Oh, and one other crucial thing: if you have f : X -> Y and g : Y -> Z then (gf)* = g*f*: composition of maps between topological space corresponds to composition of homomorphisms between their homology groups.

(You can do homology "over" any commutative ring. The groups you get are actually modules over that ring. It happens that the ring of integers mod 2 is what we want to use. A simplex is, topologically, the same thing as a ball, and its boundary the same thing as a sphere.)

OK. So, first of all suppose not only that the number of good simplices isn't odd, but that it's actually zero. Then *f* maps the whole of our simplex to its boundary. Let's also consider the rather boring map *g* from the boundary to the whole simplex that just leaves every point where it is. Now, if the thing we're trying to prove is true in lower dimensions then in particular the map *gf* -- start on the boundary of the simplex, stay where you are using *g*, and then map to the boundary of the simplex again using *f* -- has an image that, so to speak, covers each boundary face of the simplex an odd number of times. This guarantees -- sorry, I'm eliding some details here -- that (*gf*)* (from the cyclic group of order 2 to the cyclic group of order 2) doesn't map everything to the identity. But that's impossible, because (*gf*)*=*g***f** and the map *f** maps to Hn(whole simplex) which is the trivial group.

Unfortunately, what we actually need to assume in order to prove this thing by contradiction is something weaker: merely that the number of good simplices is even. We can basically do the same thing, because homology mod 2 "can't see" things that happen an even number of times, but to see that we need to look a bit further into how homology works. I'm not going to lay it all out here, but the idea is that to build the Hd(X) we begin with a space of things called "chains" which are like linear combinations (in this case over the field with two elements) of bits of X, we define a "boundary" operator which takes combinations of d-dimensional bits of X and turns them into combinations of (d-1)-dimensional bits in such a way that the boundary of the boundary of anything is always zero, and then we define Hd(x) as a quotient object: (d-dimensional things with zero boundary) / (boundaries of d+1-dimensional things). Then the way we go from *f* (a map of topological spaces) to *f** (a homomorphism of homology groups) is that *f* extends in a natural way to a map between chains, and then it turns out that this map interacts with the boundary operator in the "right" way for this to yield a map between homology groups. And (getting, finally, to the point) if in our situation the number of good simplices is even, then this means that the map of chains corresponding to *f* sends anything in n dimensions to zero (essentially because it means that the interior of the simplex gets covered *an even number of times* and when working mod 2, even numbers are zero), which means that we can think of *f** as mapping not to the homology groups of the whole simplex but to those of its boundary -- and then the argument above goes through the same as before.

I apologize for the handwaving above. (Specifically, the sentence beginning "This guarantees".) If you're familiar with this stuff, it will be apparent how to fill in the details. If not, trying to fill them in will only add to the pain of what's already too long a comment :-).

This is clearly much too much machinery to use here. I suspect that if we took the argument above, figured out *exactly* what bits of machinery it uses, and then optimized ruthlessly we might end up with a neat purely-combinatorial proof, but I regret that I am too lazy to try right now.

The linked article is interesting, and also suggests that it's not as simple as

because the issue isn't simply "our system sometimes misclassifies people as animals", it's "our system sometimes misclassifies people as animals, and one not-so-rare case of this happens to line up with an incredibly offensive old racist slur" -- and that last bit is a subtle fact about human affairs that there's no possible way the system could have learned from looking at labelled samples of images. The dataset

hada good mix of races in it; humansdolook rather like other great apes; in the absence of the long horrible history of racism this misclassification might have been benign. To do better the system would need, in some sense, toknow about racism.Maybe the best one can do really is something like artificially forbidding classifications like "ape" and "gorilla" and "monkey" unless the activations for classifications like "human" are very very low, at least until we have an image classifier that's genuinely intelligent and understands human history.

(There are probably a lot of other misclassifications that are anomalously offensive, though few will have the weight of centuries of slavery and racist abuse behind them. Fixing them would

alsorequire the system to "know" about details of human history, and again the best one can do might be to push down certain activations when it's at all possible that the image might be of people.)