# All of abramdemski's Comments + Replies

Refactoring Alignment (attempt #2)

Seems fair. I'm similarly conflicted. In truth, both the generalization-focused path and the objective-focused path look a bit doomed to me.

Re-Define Intent Alignment?

Re-Define Intent Alignment?

I would further add that looking for difficulties created by the simplification seems very intellectually productive. (Solving "embedded agency problems" seems to genuinely allow you to do new things, rather than just soothing philosophical worries.) But yeah, I would agree that if we're defining mesa-objective anyway, we're already in the business of assuming some agent/environment boundary.

1Edouard Harris1moYep, strongly agree. And a good first step to doing this is to actually build as robust a simplification as you can, and then see where it breaks. (Working on it.)
Re-Define Intent Alignment?

(see the unidentifiability in IRL paper)

Ah, I wasn't aware of this!

Btw, if you're aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can't carve up a world, or recover a consistent utility function through this sort of process — please let me know. I'm directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.

I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/worl... (read more)

1Edouard Harris1moOh for sure. I wouldn't recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what could be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary, draw a second boundary, investigate that; etc. And then see whether any patterns emerge that might fit an intuitive notion of agency. But the only fundamentally real object here is always going to be the whole system, absolutely. As I understand, something like AIXI forces you to draw one particular boundary because of the way the setting is constructed (infinite on one side, finite on the other). So I'd agree that sort of thing is more fragile. The multiagent setting is interesting though, because it gets you into the game of carving up your universe into more than 2 pieces. Again it would be neat to investigate a setting like this with different choices of boundaries and see if some choices have more interesting properties than others.
Re-Define Intent Alignment?

Right, exactly. (I should probably have just referred to that, but I was trying to avoid reference-dumping.)

Refactoring Alignment (attempt #2)

I pretty strongly endorse the new diagram with the pseudo-equivalences, with one caveat (much the same comment as on your last post)... I think it's a mistake to think of only mesa-optimizers as having "intent" or being "goal-oriented" unless we start to be more inclusive about what we mean by "mesa-optimizer" and "mesa-objective." I don't think those terms as defined in RFLO actually capture humans, but I definitely want to say that we're "goal-oriented" and have "intent."

But the graph structure makes perfect sense, I just am doing the mental substitution

3Jack Koch1moThis sounds reasonable and similar to the kinds of ideas for understanding agents' goals as cognitively implemented that I've been exploring recently. The funny thing is I am actually very unsatisfied with a purely behavioral notion of a model's objective, since a deceptive model would obviously externally appear to be a non-deceptive model in training. I just don't think there will be one part of the network we can point to and clearly interpret as being some objective function that the rest of the system's activity is optimizing. Even though I am partial to the generalization focused approach (in part because it kind of widens the goal posts with the "acceptability" vs. "give the model exactly the correct goal" thing), I still would like to have a more cognitive understanding of a system's "goals" because that seems like one of the best ways to make good predictions about how the system will generalize under distributional shift. I'm not against assuming some kind of explicit representation of goal content within a system (for sufficiently powerful systems); I'm just against assuming that that content will look like a mesa-objective as originally defined.
Refactoring Alignment (attempt #2)

Maybe a very practical question about the diagram: is there a REASON for there to be no "sufficient together" linkage from "Intent Alignment" and "Robustness" up to "Behavioral Alignment"?

Leaning hard on my technical definitions:

• Robustness: Performing well on the base objective in a wide range of circumstances.
• Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)

These two together do not quite imply behavioral alignment, becau... (read more)

Refactoring Alignment (attempt #2)

I think there's another reason why factorization can be useful here, which is the articulation of sub-problems to try.

For example, in the process leading up to inventing logical induction, Scott came up with a bunch of smaller properties to try for. He invented systems which got desirable properties individually, then growing combinations of desirable properties, and finally, figured out how to get everything at once. However, logical induction doesn't have parts corresponding to those different subproblems.

It can be very useful to individually achieve, sa... (read more)

Re-Define Intent Alignment?

I agree that we need a notion of "intent" that doesn't require a purely behavioral notion of a model's objectives, but I think it should also not be limited strictly to mesa-optimizers, which neither Rohin nor I expect to appear in practice. (Mesa-optimizers appear to me to be the formalization of the idea "what if ML systems, which by default are not well-described as EU maximizers, learned to be EU maximizers?" I suspect MIRI people have some unshared intuitions about why we might expect this, but I currently don't have a good reason to believe this.)

1Jack Koch1moIs this related to your post An Orthodox Case Against Utility Functions [https://www.lesswrong.com/posts/A8iGaZ3uHNNGgJeaD/an-orthodox-case-against-utility-functions] ? It's been on my to-read list for a while; I'll be sure to give it a look now.
Re-Define Intent Alignment?

They can't? Why not?

I meant to invoke a no-free-lunch type intuition; we can always construct worlds where some particular tool isn't useful.

My go-to would be "a world that checks what an InfraBayesian would expect, and does the opposite". This is enough for the narrow point I was trying to make (that InfraBayes does express some kind of regularity assumption about the world), but it's not very illustrative or compelling for my broader point (that InfraBayes plausibly addresses your concerns about learning theory). So I'll try to tell a better stor... (read more)

2Rohin Shah1moSorry, I meant that that was my central complaint about existing theoretical work that is trying to explain neural net generalization. (I was mostly thinking of work outside of the alignment community.) I wasn't trying to make a claim about all theoretical work. It's my central complaint because we ~know that such an assumption is necessary (since the same neural net that generalizes well on real MNIST can also memorize a randomly labeled MNIST where it will obviously fail to generalize). I feel pretty convinced by this :) In particular the assumption on the real world could be something like "there exists a partial model that describes the real world well enough that we can prove a regret bound that is not vacuous" or something like that. And I agree this seems like a reasonable assumption. Tbc I would see this as a success. I am interested! I listed it [https://www.alignmentforum.org/posts/vayxfTSQEDtwhPGpW/refactoring-alignment-attempt-2?commentId=duttLykdBAa4rMD7K] as one of the topics I saw as allowing us to make claims about objective robustness. I'm just saying that the current work doesn't seem to be making much progress (I agree now though that InfraBayes is plausibly on a path where it could eventually help). Fwiw I don't feel the force of this intuition, they seem about equally surprising (but I agree with you that it doesn't seem cruxy).
Re-Define Intent Alignment?

No such thing is possible in reality, as an agent cannot exist without its environment, so why shouldn't we talk about the mesa-objective being over a perturbation set, too, just that it has to be some function of the model's internal features?

This makes some sense, but I don't generally trust some "perturbation set" to in fact capture the distributional shift which will be important in the real world. There has to at least be some statement that the perturbation set is actually quite broad. But I get the feeling that if we could make the right statement there, we would understand the problem in enough detail that we might have a very different framing. So, I'm not sure what to do here.

Refactoring Alignment (attempt #2)

Great! I feel like we're making progress on these basic definitions.

Re-Define Intent Alignment?

InfraBayes doesn't look for the regularity in reality that NNs are taking advantage of, agreed. But InfraBayes is exactly about "what kind of regularity assumptions can we realistically make about reality?" You can think of it as a reaction to the unrealistic nature of the regularity assumptions which Solomonoff induction makes. So it offers an answer to the question "what useful+realistic regularity assumptions could we make?"

The InfraBayesian answer is "partial models". IE, the idea that even if reality cannot be completely described by usable models, pe... (read more)

2Rohin Shah2moThey can't? Why not? Maybe the "usefully" part is doing a lot of work here -- can all worlds be described (perhaps not usefully) by partial models? If so, I think I have the same objection, since it doesn't seem like any of the technical results in InfraBayes depend on some notion of "usefulness". (I think it's pretty likely I'm just flat out wrong about something here, given how little I've thought about InfraBayesianism, but if so I'd like to know how I'm wrong.)
Refactoring Alignment (attempt #2)

I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that's done.

But it seems to me that there's something missing in terms of acceptability.

The definition of "objective robustness" I used says "aligns with the base objective" (including off-distribution). But I think this isn't an appropriate representation of your approach. Rather, "objective robustness" has to be defined something like "generalizes acceptably". Then, ideas like adversarial training and checks and balances make sense as ... (read more)

2Rohin Shah2moYeah, strong +1.
Re-Define Intent Alignment?

All of that made perfect sense once I thought through it, and I tend to agree with most it. I think my biggest disagreement with you is that (in your talk) you said you don't expect formal learning theory work to be relevant. I agree with your points about classical learning theory, but the alignment community has been developing basically-classical-learning-theory tools which go beyond those limitations. I'm optimistic that stuff like Vanessa's InfraBayes could help here.

Granted, there's a big question of whether that kind of thing can be competitive. (Although there could potentially be a hybrid approach.)

3Rohin Shah2moMy central complaint about existing theoretical work is that it doesn't seem to be trying to explain why neural nets learn good programs that generalize well, even when they have enough parameters to overfit and can fit a randomly labeled dataset. It seems like you need to make some assumption about the real world (i.e. an assumption about your dataset, or the training process that generated it), which people seem loathe to do. I don't currently see how any of the alignment community's tools address that complaint; for example I don't think the InfraBayes work so far is making an interesting assumption about reality. Perhaps future work will address this though?
Re-Define Intent Alignment?

I've watched your talk at SERI now.

One question I have is how you hope to define a good notion of "acceptable" without a notion of intent. In your talk, you mention looking at why the model does what it does, in addition to just looking at what it does. This makes sense to me (I talk about similar things), but, it seems just about as fraught as the notion of mesa-objective:

1. It requires approximately the same "magic transparency tech" as we need to extract mesa-objectives.
2. Even with magical transparency tech, it requires additional insight as to which reasoni
3Rohin Shah2moI don't hope this; I expect to use a version of "acceptable" that uses intent. I'm happy with "acceptable" = "trying to do what we want". I'm pessimistic about mesa-objectives existing in actual systems, based on how people normally seem to use the term "mesa-objective". If you instead just say that a "mesa objective" is "whatever the system is trying to do", without attempting to cash it out as some simple utility function that is being maximized, or the output of a particular neuron in the neural net, etc, then that seems fine to me. One other way in which "acceptability" is better is that rather than require it of all inputs, you can require it of all inputs that are reasonably likely to occur in practice, or something along those lines. (And this is what I expect we'll have to do in practice given that I don't expect to fully mechanistically understand a large neural network; the "all inputs" should really be thought of as a goal we're striving towards.) Whereas I don't see how you do this with a mesa-objective (as the term is normally used); it seems like a mesa-objective must apply on any input, or else it isn't a mesa-objective. I'm mostly not trying to make claims about which one is easier to do; rather I'm saying "we're using the wrong concepts; these concepts won't apply to the systems we actually build; here are some other concepts that will work".
Re-Define Intent Alignment?

I originally conceived of it as such, but in hindsight, it doesn't seem right.

In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional.

I don't think this is actually a con of the generalization-focused approach.

By no means did I intend it to be a con. I'll try to edit to clarify. I think it is a real pro of the generalization-focused approach that it does not rely on models having mesa-objectives (putting it in Evan's terms, there is a real poss... (read more)

Are you the historical origin of the robustness-centric approach?

Idk, probably? It's always hard for me to tell; so much of what I do is just read what other people say and make the ideas sound sane to me. But stuff I've done that's relevant:

• Talk at CHAI saying something like "daemons are just distributional shift" in August 2018, I think. (I remember Scott attending it.)
• Talk at FHI in February 2020 that emphasized a risk model where objectives generalize but capabilities don't.
• Talk at SERI conference a few months ago that explicitly argued for a focus on
Discussion: Objective Robustness and Inner Alignment Terminology

If there were a "curated posts" system on the alignment forum, I would nominate this for curation. I think it's a great post.

My Current Take on Counterfactuals

All of which I really should have remembered, since it's all stuff I have known in the past, but I am a doofus. My apologies.

(But my error wasn't being too mired in EDT, or at least I don't think it was; I think EDT is wrong. My error was having the term "counterfactual" too strongly tied in my head to what you call linguistic counterfactuals. Plus not thinking clearly about any of the actual decision theory.)

I'm glad I pointed out the difference between linguistic and DT counterfactuals, then!

It still feels to me as if your proof-based agents are unrealis

My Current Take on Counterfactuals

It's obvious how ordinary conditionals are important for planning and acting (you design a bridge so that it won't fall down if someone drives a heavy lorry across it; you don't cross a bridge because you think the troll underneath will eat you if you cross), but counterfactuals? I mean, obviously you can put them in to a particular problem

All the various reasoning behind a decision could involve material conditionals, probabilistic conditionals, logical implication, linguistic conditionals (whatever those are), linguistic counterfactuals, decision-theoret... (read more)

1gjm2moOK, I get it. (Or at least I think I do.) And, duh, indeed it turns out (as you were too polite to say in so many words) that I was distinctly confused. So: Using ordinary conditionals in planning your actions commits you to reasoning like "If (here in the actual world it turns out that) I choose to smoke this cigarette, then that makes it more likely that I have the weird genetic anomaly that causes both desire-to-smoke and lung cancer, so I'm more likely to die prematurely and horribly of lung cancer, so I shouldn't smoke it", which makes wrong decisions. So you want to use some sort of conditional that doesn't work that way and rather says something more like "suppose everything about the world up to now is exactly as it is in the actual world, but magically-but-without-the-existence-of-magic-having-consequences I decide to do X; what then?". And this is what you're calling decision-theoretic counterfactuals, and the question is exactly what they should be; EDT says no, just use ordinary conditionals, CDT says pretty much what I just said, etc. The "smoking lesion" shows that EDT can give implausible results; "Death in Damascus" shows that CDT can give implausible results; etc. All of which I really should have remembered, since it's all stuff I have known in the past, but I am a doofus. My apologies. (But my error wasn't being too mired in EDT, or at least I don't think it was; I think EDT is wrong. My error was having the term "counterfactual" too strongly tied in my head to what you call linguistic counterfactuals. Plus not thinking clearly about any of the actual decision theory.) It still feels to me as if your proof-based agents are unrealistically narrow. Sure, they can incorporate whatever beliefs they have about the real world as axioms for their proofs -- but only if those axioms end up being consistent, which means having perfectly consistent beliefs. The beliefs may of course be probabilistic, but then that means that all those beliefs have to hav
Decision Theory

Agreed. The asymmetry needs to come from the source code for the agent.

In the simple version I gave, the asymmetry comes from the fact that the agent checks for a proof that x>y before checking for a proof that y>x. If this was reversed, then as you said, the Lobian reasoning would make the agent take the 10, instead of the 5.

In a less simple version, this could be implicit in the proof search procedure. For example, the agent could wait for any proof of the conclusion x>y or y>x, and make a decision based on whichever happened first. Then ther... (read more)

Decision Theory

While I agree that the algorithm might output 5, I don't share the intuition that it's something that wasn't 'supposed' to happen, so I'm not sure what problem it was meant to demonstrate.

OK, this makes sense to me. Instead of your (A) and (B), I would offer the following two useful interpretations:

1: From a design perspective, the algorithm chooses 5 when 10 is better. I'm not saying it has "computed argmax incorrectly" (as in your A); an agent design isn't supposed to compute argmax (argmax would be insufficient to solve this problem, because we're not g... (read more)

Decision Theory

Yep, agreed. I used the language "false antecedents" mainly because I was copying the language in the comment I replied to, but I really had in mind "demonstrably false antecedents".

My Current Take on Counterfactuals

Yeah, interesting. I don't share your intuition that nested counterfactuals seem funny. The example you give doesn't seem ill-defined due to the nesting of counterfactuals. Rather, the antecedent doesn't seem very related to the consequent, which generally has a tendency to make counterfactuals ambiguous. If you ask "if calcium were always ionic, would Nixon have been elected president?" then I'm torn between three responses:

1. "No" because if we change chemistry, everything changes.
2. "Yes" because counterfactuals keep everything the same as much as possible, e

I agree that much of what's problematic about the example I gave is that the "inner" counterfactuals are themselves unclear. I was thinking that this makes the nested counterfactual harder to make sense of (exactly because it's unclear what connection there might be between them) but on reflection I think you're right that this isn't really about counterfactual nesting and that if we picked other poorly-defined (non-counterfactual) propositions we'd get a similar effect: "If it were morally wrong to eat shellfish, would humans Really Truly Have Free Will?"... (read more)

Decision Theory

Hmm. I'm not following. It seems like you follow the chain of reasoning and agree with the conclusion:

The algorithm doesn't try to select an assignment with largest , but rather just outputs  if there's a valid assignment with , and  otherwise. Only  fulfills the condition, so it outputs .

This is exactly the point: it outputs 5. That's bad! But the agent as written will look perfectly reasonable to anyone who has not thought about the spurious proof problem. So, we want general tools to avoid t... (read more)

3Ian Televan2moWhile I agree that the algorithm might output 5, I don't share the intuition that it's something that wasn't 'supposed' to happen, so I'm not sure what problem it was meant to demonstrate. I thought of a few ways to interpret it, but I'm not sure which one, if any, was the intended interpretation: a) The algorithm is defined to compute argmax, but it doesn't output argmax because of false antecedents. - but I would say that it's not actually defined to compute argmax, therefore the fact that it doesn't output argmax is not a problem. b) Regardless of the output, the algorithm uses reasoning from false antecedents, which seems nonsensical from the perspective of someone who uses intuitive conditionals, which impedes its reasoning. - it may indeed seem nonsensical, but if 'seeming nonsensical' doesn't actually impede its ability to select actions wich highest utility (when it's actually defined to compute argmax), then I would say that it's also not a problem. Furthermore, wouldn't MUDT be perfectly satisfied with the tuplep1:(x=0,y=10,A() =10,U()=10)? It also uses 'nonsensical' reasoning 'A()=5 => U()=0' but still outputs action with highest utility. c) Even when the use of false antecedents doesn't impede its reasoning, the way it arrives at its conclusions is counterintuitive to humans, which means that we're more likely to make a catastrophic mistake when reasoning about how the agent reasons. - Maybe? I don't have access to other people's intuitions, but when I read the example, I didn't have any intuitive feeling of what the algorithm would do, so instead I just calculated all assignments(x,y)∈{0,5,10}2, eliminated all inconsistent ones and proceeded from there. And this issue wouldn't be unique to false antecedents, there are other perfectly valid pieces of logic that might nonetheless seem counterintuitive to humans, for example the puzzle with islanders and blue eyes [https://xkcd.com/blue_eyes.html]. --------------------------------------------------
0TAG2moNo, it's contradictory assumptions. False but consistent assumptions are dual to consistent-and-true assumptions...so you can only infer a mutually consistent set of propositions from either. To put it another way, a formal system has no way of knowing what would be true or false for reasons outside itself, so it has no way of reacting to a merely false statement. But a contradiction is definable within a formal system. To.put it yet another way... contradiction in, contradiction out
My Current Take on Counterfactuals

Ah, I wasn't strongly differentiating between the two, and was actually leaning toward your proposal in my mind. The reason I was not differentiating between the two was that the probability of C(A|B) behaves a lot like the probabilistic value of Prc(A|B). I wasn't thinking of nearby-world semantics or anything like that (and would contrast my proposal with such a proposal), so I'm not sure whether the C(A|B) notation carries any important baggage beyond that. However, I admit it could be an important distinction; C(A|B) is itself a proposition, which can ... (read more)

I never found Stalnaker's thesis at all plausible, not because I'd thought of the ingenious little calculation you give but because it just seems obviously wrong intuitively. But I suppose if you don't have any presuppositions about what sort of notion an implication is allowed to be, you don't get to reject it on those grounds. So I wasn't really entitled to say "Pr(A|B) is not the same thing as Pr(B=>A) for any particular notion of implication", since I hadn't thought of that calculation.

Anyway, I have just the same sense of obvious wrongness about th... (read more)

An Intuitive Guide to Garrabrant Induction

I should! But I've got a lot of things to write up!

It also needs a better name, as there have been several things termed "weak logical induction" over time.

The Credit Assignment Problem
• In between … well … in between, we're navigating treacherous waters …

Right, I basically agree with this picture. I might revise it a little:

• Early, the AGI is too dumb to hack its epistemics (provided we don't give it easy ways to do so!).
• In the middle, there's a danger zone.
• When the AGI is pretty smart, it sees why one should be cautious about such things, and it also sees why any modifications should probably be in pursuit of truthfulness (because true beliefs are a convergent instrumental goal) as opposed to other reasons.
• When the AGI is really smart, it
My Current Take on Counterfactuals

I don't believe that LI provides such a Pareto improvement, but I suspect that there's a broader theory which contains the two.

Overall, I place much less weight on arguments that revolve around the presumed nature of human values compared to arguments grounded in abstract reasoning about rational agents.

Ah. I was going for the human-values argument because I thought you might not appreciate the rational-agent argument. After all, who cares what general rational agents can value, if human values happen to be well-represented by infrabayes?

But for general ra... (read more)

My Current Take on Counterfactuals

I agree inasmuch as we actually can model this sort of preferences, for a sufficiently strong meaning of "model". I feel that it's much harder to be confident about any detailed claim about human values than about the validity of a generic theory of rationality. Therefore, if the ultimate generic theory of rationality imposes some conditions on utility functions (while still leaving a very rich space of different utility functions), that will lead me to try formalizing human values within those constraints. Of course, given a candidate theory, we should po

1Vanessa Kosoy3moI would be convinced if you had a theory of rationality that is a Pareto improvement on IB (i.e. has all the good properties of IB + a more general class of utility functions). However, LI doesn't provide this AFAICT. That said, I would be interested to see some rigorous theorem about LIDT solving procrastination-like problems. As to philosophical deliberation, I feel some appeal in this point of view, but I can also easily entertain a different point of view: namely, that human values are more or less fixed and well-defined whereas philosophical deliberation is just a "show" for game theory reasons. Overall, I place much less weight on arguments that revolve around the presumed nature of human values compared to arguments grounded in abstract reasoning about rational agents.
My Current Take on Counterfactuals

If PA is consistent, then the agent cannot prove U = -10 (or anything else inconsistent) under the assumption that the agent already crossed, and therefore Löb's theorem fails to apply. In this case, there is no weird certainty that crossing is doomed.

I think this is the wrong step. Why do you think this? Just because PA is consistent doesn't mean you can't prove weird things under assumption. Look at the structure of the proof. You're objecting to an assumption. ("Suppose PA proves that crossing -> U=-10") That's a pretty weird way to object to a proof. I'm allowed to make any assumptions I like.

My guess is that you are wrestling with Lobs theorem itself. Lobs theorem is pretty weird!

Speculations against GPT-n writing alignment papers

It seems to me that the last paragraph should update you to thinking that this plan is no worse than the default. IE: yes, this plan creates additional risk because there are complicated pathways a malign gpt-n could use to get arbitrary code run on a big computer. But if people are giving it that chance anyway, it does seem like a small increase in risk with a large potential gain. (Small, not zero, for the chance that your specific gpt-n instance somehow becomes malign when others are safe, eg if something about the task actually activated a subtle malignancy not present during other tasks).

So for me a crux would be, if it's not malign, how good could we expect the papers to actually be?

An Intuitive Guide to Garrabrant Induction

First, I'm not sure exactly why you think this is bad. Care to say more? My guess is that it just doesn't fit the intuitive notion that updates should be heading toward some state of maximal knowledge. But we do fit this intuition in other ways; specifically, logical inductors eventually trust their future opinions more than their present opinions.

Personally, I found this result puzzling but far from damning.

Second, I've actually done some unpublished work on this. There is a variation of the logical induction criterion which is more relaxed (admits more t... (read more)

2Vladimir Slepnev3moInteresting! Can you write up the WLIC, here or in a separate post?
My AGI Threat Model: Misaligned Model-Based RL Agent

So it's still in the observation-utility paradigm I think, or at least it seems to me that it doesn't have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn't have to. In the human example, some people are hedonists, but others aren't.

All sounds perfectly reasonable. I just hope you recognize that it's all a big mess (because it's difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or an... (read more)

1Steve Byrnes3moYup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump [https://www.lesswrong.com/posts/Gfw7JMdKirxeSPiAk/solving-the-whole-agi-control-problem-version-0-0001] , I guess.) In particular, just like prosaic AGI alignment, my starting point is not "Building this kind of AGI is a great idea", but rather "This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I'm correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there's some chance they'll succeed whether we like it or not." Thanks, that's helpful. One way I think I would frame the problem differently than you here is: I'm happy to talk about outer and inner alignment for pedagogical purposes, but I think it's overly constraining as a framework for solving the problem. For example, (Paul-style) corrigibility is I think an attempt to cut through outer and inner alignment simultaneously, as is interpretability perhaps. And like you say, rewards don't need to be the only type of feedback. We can also set up the AGI to NOOP when the expected value of some action is <0, rather than having it always take the least bad action. (...And then don't use it in time-sensitive situations! But that's fine for working with humans to build better-aligned AGIs.) So then the goal would be something like "every catastrophic action has expected value <0 as assessed by the AGI (and also, the AGI will not be motivated to self-modify or create successors, at least not in a way that undermines that property) (and also, the AGI is sufficiently capable that it can do alignment research etc., as opposed to it sitting around NOOPing all day)". So then this could look like a pretty weirdly misaligned AGI but it has a really effective "may-lead-to-catastrophe (directly or indirectly) predictor circuit" attached. (The circuit asks "Does it pattern-match
My Current Take on Counterfactuals

OK, so, here is a question.

The abstract theory of InfraBayes (like the abstract theory of Bayes) elides computational concerns.

In reality, all of ML can more or less be thought of as using a big search for good models, where "good" means something approximately like MAP, although we can also consider more sophisticated variational targets. This introduces two different types of approximation:

1. The optimization target is approximate.
2. The optimization itself gives only approximate maxima.

What we want out of InfraBayes is a bounded regret guarantee (in settings ... (read more)

My hope is that we will eventually have computationally feasible algorithms that satisfy provable (or at least conjectured) infra-Bayesian regret bounds for some sufficiently rich hypothesis space. Currently, even in the Bayesian case, we only have such algorithms for poor hypothesis spaces, such as MDPs with a small number of states. We can also rule out such algorithms for some large hypothesis spaces, such as short programs with a fixed polynomial-time bound. In between, there should be some hypothesis space which is small enough to be feasible and rich... (read more)

My Current Take on Counterfactuals

What I'm referring to is that LI given a notion of rational uncertain expectation for the procrastination paradox -- so, less a positive result, more a framework for thinking about what behavior is reasonable.

However, I also think LIDT solves the problem in practical terms:

• In the pure procrastination-paradox problem, LIDT will eventually push the button if its logic is sound. If it did not, it would mean the conditional probability of ever pressing the button given not pressing it today remains forever higher than the conditional probability of ever pressi
1Vanessa Kosoy3moWhat is LIDT exactly? I can try to guess but I rather make sure we're both talking about the same thing. I agree inasmuch as we actually can model this sort of preferences, for a sufficiently strong meaning of "model". I feel that it's much harder to be confident about any detailed claim about human values than about the validity of a generic theory of rationality. Therefore, if the ultimate generic theory of rationality imposes some conditions on utility functions (while still leaving a very rich space of different utility functions), that will lead me to try formalizing human values within those constraints. Of course, given a candidate theory, we should poke around and see whether it can be extended to weaken the constraints.
Formal Inner Alignment, Prospectus

Just want to note that although it's been a week this is still in my thoughts, and I intend to get around to continuing this conversation... but possibly not for another two weeks.

Formal Inner Alignment, Prospectus

I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice".

A couple of separate points:

• My main worry continues to be the way bad actors have control over an io cha
2michaelcohen4moA few quick thoughts, and I'll get back to the other stuff later. That's good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms--i.e. it is not even a theoretical solution--then there is no bound on how strongly you might reasonably conclude that it is fruitless. So this kind of meta point I was making only applied to your objections about slowdown in practice. I only meant to claim I was just doing theory in a context that lacks the realizability problem, not that I had solved the realizability problem! But yes, I see what you're saying. The theory regards a "fair" demonstrator which does not depend on the operation of the computer. There are probably multiple perspectives about what level of "theoretical" that setting is. I would contend that in practice, the computer itself is not among the most complex and important causal ancestors of the demonstrator's behavior, so this doesn't present a huge challenge for practically arriving at a good model. But that's a whole can of worms. Okay good, this worry makes much more sense to me.
My Current Take on Counterfactuals

The continuity property is really important.

Formal Inner Alignment, Prospectus

Thanks for the extensive reply, and sorry for not getting around to it as quickly as I replied to some other things!

I am sorry for the critical framing, in that it would have been more awesome to get a thought-dumb of ideas for research directions from you, rather than a detailed defense of your existing work. But of course existing work must be judged, and I felt I had remained quiet about my disagreement with you for too long.

Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for M

2michaelcohen4moHaha that's fine. If you don't voice your objections, I can't respond to them! I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice". Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network's epistemic uncertainty, it's potentially a hard problem! But it also seems like a clear problem, maybe even tractable. See Taylor (2016) section 2.1--inductive ambiguity identification. If you were convinced that AGI will be made of neural networks, you could say that I have reduced the problem of inner alignment to the problem of diverse-model-extraction from a neural network, perhaps allowing a few modifications to training (if you bought that the claim that the consensus algorithm is a theoretical solution). I have never tried to claim that analogizing this approach to neural networks will be easy, but I don't think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks; my ideal situation would be that I figure out how to do something in theory, and then 50 people try to work on analogizing it to state-of-the-art AI (there are many more neural network experts out there than AIXI experts). My less ideal situation is that people provisionally treat the theoretical solution as a dead end, right up until the very point that a practical version is demonstrated. If it seemed like solving inner alignment in theory was easy (because allowing yourself an agent with the wherewithal to consider "unrealistic" models is such a boon), and there were thus lots of theoretical sol
Formal Inner Alignment, Prospectus

No, not prosaic, that particular comment was referring to the "brain-like AGI" story in my head...

Ah, ok. It sounds like I have been systematically mis-perceiving you in this respect.

By contrast, I haven't written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I'm thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation.

I would have been much more interested in your posts in the past if you had emphasized this ... (read more)

Formal Inner Alignment, Prospectus

What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.

That's one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set

Formal Inner Alignment, Prospectus
• Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).

Could you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply.

• It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sens
My Current Take on Counterfactuals

Now I have another question: how does logical induction arbitrage against contradiction? The bet on a pays $1 if a is proved. The bet on ~a pays$1 if not-a is proved. But the bet on ~a isn't "settled" when a is proved - why can't the market just go on believing its .7? (Likely this is related to my confusion with the paper).

Again, my view may have drifted a bit from the LI paper, but the way I think about this is that the market maker looks at the minimum amount of money a trader has "in any world" (in the sense described in my other comment). This exclud... (read more)

My Current Take on Counterfactuals

On each day, the reasoner receives 50¢ from T, but after day t, the reasoner must pay $1 every day thereafter. Hm. It's a bit complicated and there are several possible ways to set things up. Reading that paragraph, I'm not sure about this sentence either. In the version I was trying to explain, where traders are "forced to sell" every morning before the day of trading begins, the reasoner would receive 50¢ from the trader every day, but would return that money next morning. Also, in the version I was describing, the reasoner is forced to set the price to$1... (read more)

1Bunthut4moThinking about this in detail, it seems like what influence traders have on the market price depends on a lot more of their inner workings than just their beliefs. I was thinking in a way where each trader only had one price for the bet, below which they bought and above which they sold, no matter how many units they traded (this might contradict "continuous trading strategies" because of finite wealth), in which case there would be a range of prices that could be the "market" price, and it could stay constant even with one end of that range shifting. But there could also be an outcome like yours, if the agents demand better and better prices to trade one more unit of the bet.
My Current Take on Counterfactuals

I'm also sceptical of optimality results. When you're doing subjective probability, any method you come up with will be proven optimal relative to its own prior - the difference between different subjective methods is only in their ontology, and the optimality results don't protect you against mistakes there. Also, when you're doing subjectivism, and it turns out the methods required to reach some optimality condition aren't subjectively optimal, you say "Don't be a stupid frequentist and do the subjectively optimal thing instead". So, your bottom line is

My Current Take on Counterfactuals

What makes you think that theres a "right" prior? You want a "good" learning mechanism for counterfactuals. To be good, such a mechanism would have to learn to make the inferences we consider good, at least with the "right" prior. But we can't pinpoint any wrong inference in Troll Bridge. It doesn't seem like whats stopping us from pinpointing the mistake in Troll Bridge is a lack of empirical data. So, a good mechanism would have to learn to be susceptible to Troll Bridge, especially with the "right" prior. I just don't see what would be a good reason for