All of abramdemski's Comments + Replies

My Current Take on Counterfactuals

I don't believe that LI provides such a Pareto improvement, but I suspect that there's a broader theory which contains the two.

Overall, I place much less weight on arguments that revolve around the presumed nature of human values compared to arguments grounded in abstract reasoning about rational agents.

Ah. I was going for the human-values argument because I thought you might not appreciate the rational-agent argument. After all, who cares what general rational agents can value, if human values happen to be well-represented by infrabayes?

But for general ra... (read more)

My Current Take on Counterfactuals

I agree inasmuch as we actually can model this sort of preferences, for a sufficiently strong meaning of "model". I feel that it's much harder to be confident about any detailed claim about human values than about the validity of a generic theory of rationality. Therefore, if the ultimate generic theory of rationality imposes some conditions on utility functions (while still leaving a very rich space of different utility functions), that will lead me to try formalizing human values within those constraints. Of course, given a candidate theory, we should po

1Vanessa Kosoy12dI would be convinced if you had a theory of rationality that is a Pareto improvement on IB (i.e. has all the good properties of IB + a more general class of utility functions). However, LI doesn't provide this AFAICT. That said, I would be interested to see some rigorous theorem about LIDT solving procrastination-like problems. As to philosophical deliberation, I feel some appeal in this point of view, but I can also easily entertain a different point of view: namely, that human values are more or less fixed and well-defined whereas philosophical deliberation is just a "show" for game theory reasons. Overall, I place much less weight on arguments that revolve around the presumed nature of human values compared to arguments grounded in abstract reasoning about rational agents.
My Current Take on Counterfactuals

If PA is consistent, then the agent cannot prove U = -10 (or anything else inconsistent) under the assumption that the agent already crossed, and therefore Löb's theorem fails to apply. In this case, there is no weird certainty that crossing is doomed.

I think this is the wrong step. Why do you think this? Just because PA is consistent doesn't mean you can't prove weird things under assumption. Look at the structure of the proof. You're objecting to an assumption. ("Suppose PA proves that crossing -> U=-10") That's a pretty weird way to object to a proof. I'm allowed to make any assumptions I like.

My guess is that you are wrestling with Lobs theorem itself. Lobs theorem is pretty weird!

Speculations against GPT-n writing alignment papers

It seems to me that the last paragraph should update you to thinking that this plan is no worse than the default. IE: yes, this plan creates additional risk because there are complicated pathways a malign gpt-n could use to get arbitrary code run on a big computer. But if people are giving it that chance anyway, it does seem like a small increase in risk with a large potential gain. (Small, not zero, for the chance that your specific gpt-n instance somehow becomes malign when others are safe, eg if something about the task actually activated a subtle malignancy not present during other tasks).

So for me a crux would be, if it's not malign, how good could we expect the papers to actually be?

An Intuitive Guide to Garrabrant Induction

First, I'm not sure exactly why you think this is bad. Care to say more? My guess is that it just doesn't fit the intuitive notion that updates should be heading toward some state of maximal knowledge. But we do fit this intuition in other ways; specifically, logical inductors eventually trust their future opinions more than their present opinions.

Personally, I found this result puzzling but far from damning.

Second, I've actually done some unpublished work on this. There is a variation of the logical induction criterion which is more relaxed (admits more t... (read more)

2Vladimir Slepnev19dInteresting! Can you write up the WLIC, here or in a separate post?
My AGI Threat Model: Misaligned Model-Based RL Agent

So it's still in the observation-utility paradigm I think, or at least it seems to me that it doesn't have an automatic incentive to wirehead. It could want to wirehead, if the value function winds up seeing wireheading as desirable for any reason, but it doesn't have to. In the human example, some people are hedonists, but others aren't.

All sounds perfectly reasonable. I just hope you recognize that it's all a big mess (because it's difficult to see how to provide evidence in a way which will, at least eventually, rule out the wireheading hypothesis or an... (read more)

1Steve Byrnes19dYup! This was a state-the-problem-not-solve-it post. (The companion solving-the-problem post is this brain dump [https://www.lesswrong.com/posts/Gfw7JMdKirxeSPiAk/solving-the-whole-agi-control-problem-version-0-0001] , I guess.) In particular, just like prosaic AGI alignment, my starting point is not "Building this kind of AGI is a great idea", but rather "This is a way to build AGI that could really actually work capabilities-wise (especially insofar as I'm correct that the human brain works along these lines), and that people are actively working on (in both ML and neuroscience), and we should assume there's some chance they'll succeed whether we like it or not." Thanks, that's helpful. One way I think I would frame the problem differently than you here is: I'm happy to talk about outer and inner alignment for pedagogical purposes, but I think it's overly constraining as a framework for solving the problem. For example, (Paul-style) corrigibility is I think an attempt to cut through outer and inner alignment simultaneously, as is interpretability perhaps. And like you say, rewards don't need to be the only type of feedback. We can also set up the AGI to NOOP when the expected value of some action is <0, rather than having it always take the least bad action. (...And then don't use it in time-sensitive situations! But that's fine for working with humans to build better-aligned AGIs.) So then the goal would be something like "every catastrophic action has expected value <0 as assessed by the AGI (and also, the AGI will not be motivated to self-modify or create successors, at least not in a way that undermines that property) (and also, the AGI is sufficiently capable that it can do alignment research etc., as opposed to it sitting around NOOPing all day)". So then this could look like a pretty weirdly misaligned AGI but it has a really effective "may-lead-to-catastrophe (directly or indirectly) predictor circuit" attached. (The circuit asks "Does it pattern-match
My Current Take on Counterfactuals

OK, so, here is a question.

The abstract theory of InfraBayes (like the abstract theory of Bayes) elides computational concerns.

In reality, all of ML can more or less be thought of as using a big search for good models, where "good" means something approximately like MAP, although we can also consider more sophisticated variational targets. This introduces two different types of approximation:

1. The optimization target is approximate.
2. The optimization itself gives only approximate maxima.

What we want out of InfraBayes is a bounded regret guarantee (in settings ... (read more)

My hope is that we will eventually have computationally feasible algorithms that satisfy provable (or at least conjectured) infra-Bayesian regret bounds for some sufficiently rich hypothesis space. Currently, even in the Bayesian case, we only have such algorithms for poor hypothesis spaces, such as MDPs with a small number of states. We can also rule out such algorithms for some large hypothesis spaces, such as short programs with a fixed polynomial-time bound. In between, there should be some hypothesis space which is small enough to be feasible and rich... (read more)

My Current Take on Counterfactuals

What I'm referring to is that LI given a notion of rational uncertain expectation for the procrastination paradox -- so, less a positive result, more a framework for thinking about what behavior is reasonable.

However, I also think LIDT solves the problem in practical terms:

• In the pure procrastination-paradox problem, LIDT will eventually push the button if its logic is sound. If it did not, it would mean the conditional probability of ever pressing the button given not pressing it today remains forever higher than the conditional probability of ever pressi
1Vanessa Kosoy20dWhat is LIDT exactly? I can try to guess but I rather make sure we're both talking about the same thing. I agree inasmuch as we actually can model this sort of preferences, for a sufficiently strong meaning of "model". I feel that it's much harder to be confident about any detailed claim about human values than about the validity of a generic theory of rationality. Therefore, if the ultimate generic theory of rationality imposes some conditions on utility functions (while still leaving a very rich space of different utility functions), that will lead me to try formalizing human values within those constraints. Of course, given a candidate theory, we should poke around and see whether it can be extended to weaken the constraints.
Formal Inner Alignment, Prospectus

Just want to note that although it's been a week this is still in my thoughts, and I intend to get around to continuing this conversation... but possibly not for another two weeks.

Formal Inner Alignment, Prospectus

I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice".

A couple of separate points:

• My main worry continues to be the way bad actors have control over an io cha
2michaelcohen1moA few quick thoughts, and I'll get back to the other stuff later. That's good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms--i.e. it is not even a theoretical solution--then there is no bound on how strongly you might reasonably conclude that it is fruitless. So this kind of meta point I was making only applied to your objections about slowdown in practice. I only meant to claim I was just doing theory in a context that lacks the realizability problem, not that I had solved the realizability problem! But yes, I see what you're saying. The theory regards a "fair" demonstrator which does not depend on the operation of the computer. There are probably multiple perspectives about what level of "theoretical" that setting is. I would contend that in practice, the computer itself is not among the most complex and important causal ancestors of the demonstrator's behavior, so this doesn't present a huge challenge for practically arriving at a good model. But that's a whole can of worms. Okay good, this worry makes much more sense to me.
My Current Take on Counterfactuals

The continuity property is really important.

Formal Inner Alignment, Prospectus

Thanks for the extensive reply, and sorry for not getting around to it as quickly as I replied to some other things!

I am sorry for the critical framing, in that it would have been more awesome to get a thought-dumb of ideas for research directions from you, rather than a detailed defense of your existing work. But of course existing work must be judged, and I felt I had remained quiet about my disagreement with you for too long.

Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for M

2michaelcohen1moHaha that's fine. If you don't voice your objections, I can't respond to them! I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice". Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network's epistemic uncertainty, it's potentially a hard problem! But it also seems like a clear problem, maybe even tractable. See Taylor (2016) section 2.1--inductive ambiguity identification. If you were convinced that AGI will be made of neural networks, you could say that I have reduced the problem of inner alignment to the problem of diverse-model-extraction from a neural network, perhaps allowing a few modifications to training (if you bought that the claim that the consensus algorithm is a theoretical solution). I have never tried to claim that analogizing this approach to neural networks will be easy, but I don't think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks; my ideal situation would be that I figure out how to do something in theory, and then 50 people try to work on analogizing it to state-of-the-art AI (there are many more neural network experts out there than AIXI experts). My less ideal situation is that people provisionally treat the theoretical solution as a dead end, right up until the very point that a practical version is demonstrated. If it seemed like solving inner alignment in theory was easy (because allowing yourself an agent with the wherewithal to consider "unrealistic" models is such a boon), and there were thus lots of theoretical sol
Formal Inner Alignment, Prospectus

No, not prosaic, that particular comment was referring to the "brain-like AGI" story in my head...

Ah, ok. It sounds like I have been systematically mis-perceiving you in this respect.

By contrast, I haven't written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I'm thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation.

I would have been much more interested in your posts in the past if you had emphasized this ... (read more)

Formal Inner Alignment, Prospectus

What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.

That's one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set

Formal Inner Alignment, Prospectus
• Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).

Could you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply.

• It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sens
My Current Take on Counterfactuals

Now I have another question: how does logical induction arbitrage against contradiction? The bet on a pays $1 if a is proved. The bet on ~a pays$1 if not-a is proved. But the bet on ~a isn't "settled" when a is proved - why can't the market just go on believing its .7? (Likely this is related to my confusion with the paper).

Again, my view may have drifted a bit from the LI paper, but the way I think about this is that the market maker looks at the minimum amount of money a trader has "in any world" (in the sense described in my other comment). This exclud... (read more)

My Current Take on Counterfactuals

On each day, the reasoner receives 50¢ from T, but after day t, the reasoner must pay $1 every day thereafter. Hm. It's a bit complicated and there are several possible ways to set things up. Reading that paragraph, I'm not sure about this sentence either. In the version I was trying to explain, where traders are "forced to sell" every morning before the day of trading begins, the reasoner would receive 50¢ from the trader every day, but would return that money next morning. Also, in the version I was describing, the reasoner is forced to set the price to$1... (read more)

1Bunthut1moThinking about this in detail, it seems like what influence traders have on the market price depends on a lot more of their inner workings than just their beliefs. I was thinking in a way where each trader only had one price for the bet, below which they bought and above which they sold, no matter how many units they traded (this might contradict "continuous trading strategies" because of finite wealth), in which case there would be a range of prices that could be the "market" price, and it could stay constant even with one end of that range shifting. But there could also be an outcome like yours, if the agents demand better and better prices to trade one more unit of the bet.
My Current Take on Counterfactuals

I'm also sceptical of optimality results. When you're doing subjective probability, any method you come up with will be proven optimal relative to its own prior - the difference between different subjective methods is only in their ontology, and the optimality results don't protect you against mistakes there. Also, when you're doing subjectivism, and it turns out the methods required to reach some optimality condition aren't subjectively optimal, you say "Don't be a stupid frequentist and do the subjectively optimal thing instead". So, your bottom line is

My Current Take on Counterfactuals

What makes you think that theres a "right" prior? You want a "good" learning mechanism for counterfactuals. To be good, such a mechanism would have to learn to make the inferences we consider good, at least with the "right" prior. But we can't pinpoint any wrong inference in Troll Bridge. It doesn't seem like whats stopping us from pinpointing the mistake in Troll Bridge is a lack of empirical data. So, a good mechanism would have to learn to be susceptible to Troll Bridge, especially with the "right" prior. I just don't see what would be a good reason for

Formal Inner Alignment, Prospectus

To me, the post as written seems like enough to spell out my optimism... there multiple directions for formal work which seem under-explored to me. Well, I suppose I didn't focus on explaining why things seem under-explored. Hopefully the writeup-to-come will make that clear.

Formal Inner Alignment, Prospectus

I agree with much of this. I over-sold the "absence of negative story" story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, "mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?" -- and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the excep... (read more)

2Richard Ngo1moI like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time). I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail - but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it's easier to tell where any disagreement lies.
My Current Take on Counterfactuals

If a sentence is undecidable, then you could have two traders who disagree on its value indefinitely: one would have a highest price to buy, thats below the others lowest price to sell. But then anything between those two prices could be the "market price", in the classical supply and demand sense. If you say that the "payout" of a share is what you can sell it for... well, the "physical causation" trader is also buying shares on the counterfactual option that won't happen. And if he had to sell those, he couldn't sell them at a price close to where he bou

1Bunthut1moSo I've reread the logical induction paper [https://intelligence.org/files/LogicalInduction.pdf] for this, and I'm not sure I understand exploitation. Under 3.5, it says: So this sounds like before day t, T buys a share every day, and those shares never pay out - otherwise T would receive $t on day t in addition to everything mentioned here. Why? In the version that I have in my head, theres a market with PCH and LCH in it that assigns constant price to the unactualised bet, so neither of them gain or lose anything with their trades on it, and LCH exploits PCH on the actualised ones. So if I'm understanding this correctly: The conditional contract on (a|b) pays if a&b is proved, if a&~b is proved, and if ~a&~b is proved. Now I have another question: how does logical induction arbitrage against contradiction? The bet on a pays$1 if a is proved. The bet on ~a pays \$1 if not-a is proved. But the bet on ~a isn't "settled" when a is proved - why can't the market just go on believing its .7? (Likely this is related to my confusion with the paper). What makes you think that theres a "right" prior? You want a "good" learning mechanism for counterfactuals. To be good, such a mechanism would have to learn to make the inferences we consider good, at least with the "right" prior. But we can't pinpoint any wrong inference in Troll Bridge. It doesn't seem like whats stopping us from pinpointing the mistake in Troll Bridge is a lack of empirical data. So, a good mechanism would have to learn to be susceptible to Troll Bridge, especially with the "right" prior. I just don't see what would be a good reason for thinking theres a "right" prior that avoids Troll Bridge (other than "there just has to be some way of avoiding it"), that wouldn't also let us tell directly how to think about Troll Bridge, no learning needed.
Formal Inner Alignment, Prospectus

From your section 'the formal problem', I gather that the problems you associate with inner alignment failures are those that might produce treacherous turns or other forms of reward hacking.

It's interesting that you think of treacherous turns as automatically reward hacking. I would differentiate reward hacking as cases where the treacherous turn is executed with the intention of taking over control of reward. In general, treacherous turns can be based on arbitrary goals. A fully inner-aligned system can engage in reward hacking.

It seems to me that the po

Formal Inner Alignment, Prospectus

I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using "blind search over a super-broad, probably-even-Turing-complete, space of models" as one of its ingredients. I guess I'm just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs... (Of course I just wind up with a different set of unsolved AGI safety problems instead...)

Wait, you think your prosaic story doesn't involve blind search over a super-broad space of models??

I think any prosaic story involves blind se... (read more)

3Steve Byrnes1moNo, not prosaic, that particular comment was referring to the "brain-like AGI" story in my head... Like, I tend to emphasize the overlap between my brain-like AGI story and prosaic AI. There is plenty of overlap. Like they both involve "neural nets", and (something like) gradient descent, and RL, etc. By contrast, I haven't written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I'm thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation. In that case, yes there's a search over a model space, because we need to find the (more complicated cousin of a) PGM world-model. But I don't think that model space affords the same opportunities for mischief that you would get in, say, a 100-layer DNN. Not having thought about it too hard... :-P
Formal Inner Alignment, Prospectus

I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors).

If the universal prior were benign but NNs were still potentially malign, I think I would argue strongly against the use of NNs and in fa... (read more)

Formal Inner Alignment, Prospectus

I also believe that we're not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree.

Even with a significantly improved definition of goal-directedness, I think we'd be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that env... (read more)

1Adam Shimi1moOh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it's only the first step. That being said, I think I'm more optimistic than you on the result, for a couple of reasons: * One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it's possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive. * One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains. * Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness. What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those. That's one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition. Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some
Formal Inner Alignment, Prospectus

That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its trainin

1Steve Byrnes1moThat's fair. Other possible approaches are "try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so", or "intepretability that looks for the AGI imagining dangerous adversarial intelligences". I guess the fact that people don't tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible - like, that maybe there's a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous. But hard to say what's gonna work, if anything, at least at my current stage of general ignorance about the overall training process.
Formal Inner Alignment, Prospectus

Trying to lay this disagreement out plainly:

According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc).

According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don... (read more)

This is a good summary.

I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:

• It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
• The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal
Formal Inner Alignment, Prospectus

However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all.

The way I'm currently thinking of things, I would say the reverse also applies in this case.

We can turn optimization-under-uncertainty into well-defined optimization by assuming ... (read more)

Formal Inner Alignment, Prospectus

So, I think I could write a much longer response to this (perhaps another post), but I'm more or less not persuaded that problems should be cut up the way you say.

As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn't be classified as inner alignment is that they are apparently outer alignment. If inner alignment problems are roughly "the internal objective doesn't match the external objective" and outer alignment problems are roughly "the outer objective doesn't meet our needs/goals", then there's no reason why these h... (read more)

4johnswentworth1moI buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we're thinking about optimization-under-uncertainty, although I'm still not sure exactly what that would mean. In other words: if a problem is both, then it is useful to think of it as an outer alignment problem (because that part has to be solved regardless), and not also inner alignment (because only a narrower version of that part necessarily has to be solved). In the Dr Nefarious example, the outer misalignment causes the inner misalignment in some important sense - correcting the outer problem fixes the inner problem , but patching the inner problem would leave an outer objective which still isn't what we want. I'd be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem. I'm not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn't well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point "X" and "Y" at some things in the real world. That's why the objective function cannot be meaningfully separated from the data/prior: "f(X, Y)" doesn't mean anything, by itself. But I could imagine the po
Formal Inner Alignment, Prospectus

Right, but John is disagreeing with Evan's frame, and John's argument that such-and-such problems aren't inner alignment problems is that they are outer alignment problems.

Formal Inner Alignment, Prospectus

This is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.

Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment!

Evan actually wrote a post to explain that it isn't the complement for him (and not the compliment either :p)

Formal Inner Alignment, Prospectus

I admit that I'm more excited by doing this because you're asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).

Thanks!

I'm not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point.

Right. By "no connecti... (read more)

My AGI Threat Model: Misaligned Model-Based RL Agent

I have not properly read all of that yet, but my very quick take is that your argument for a need for online learning strikes me as similar to your argument against the classic inner alignment problem applying to the architectures you are interested in. You find what I call mesa-learning implausible for the same reasons you find mesa-optimization implausible.

Personally, I've come around to the position (seemingly held pretty strongly by other folks, eg Rohin) that mesa-learning is practically inevitable for most tasks.

My AGI Threat Model: Misaligned Model-Based RL Agent

For "access to the reward function", we need to predict what the reward function will do (which may involve hard-to-predict things like "the human will be pleased with what I've done"). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a "reward function model", and the thing-that-predicts-summed-rewards the "value function", and then to change "the value function may be different from the reward function" to "the value function may be different from the expected sum of rewards". Something like that?

Ah, that wasn't qu... (read more)

1Steve Byrnes21dHi again, I finally got around to reading those links, thanks! I think what you're saying (and you can correct me) is: observation-utility agents are safer (or at least less dangerous) than reward-maximizers-learning-the-reward, because the former avoids falling prey to what you called [https://www.lesswrong.com/posts/5bd75cc58225bf06703754b3/stable-pointers-to-value-an-agent-embedded-in-its-own-utility-function] "the easy problem of wireheading". So then the context was: First you said, If we do rollouts to decide what to do, then the value function is pointless, assuming we have access to the reward function. Then I replied, We don't have access to the reward function, because we can't perfectly predict what will happen in a complicated world. Then you said, That's bad because that means we're not in the observation-utility paradigm. But I don't think that's right, or at least not in the way I was thinking of it. We're using the current value function to decide which rollouts are good vs bad, and therefore to decide which action to take. So my "value function" is kinda playing the role of a utility function (albeit messier), and my "reward function" is kinda playing the role of "an external entity that swoops in from time to time and edits the utility function". Like, if the agent is doing terrible things, then some credit-assignment subroutine goes into the value function, looks at what is currently motivating the agent, and sets that thing to not be motivating in the future. The closest utility function analogy would be: you're trying to make an agent with a complicated opaque utility function (because it's a complicated world). You can't write the utility function down. So instead you code up an automated utility-function-editing subroutine. The way the subroutine works is that sometimes the agent does something which we recognize as bad / good, and then the subroutine edits the utility function to assign lower / higher utility to "things like that" in
My AGI Threat Model: Misaligned Model-Based RL Agent

Hmm. I guess I have this ambiguous thing where I'm not specifying whether the value function is "valuing" world-states, or actions, or plans, or all of the above, or what. I think there are different ways to set it up, and I was trying not to get bogged down in details (and/or not being very careful!)

Sure, but given most reasonable choices, there will be an analogous variant of my claim, right? IE, for most reasonable model-based RL setups, the type of the reward function will be different from the type of the value function, but there will be a "solution ... (read more)

1Steve Byrnes1moYes, thanks, that's what I should have said. For "access to the reward function", we need to predict what the reward function will do (which may involve hard-to-predict things like "the human will be pleased with what I've done"). I guess your suggestion would be to call the thing-that-predicts-what-the-reward-will-be a "reward function model", and the thing-that-predicts-summed-rewards the "value function", and then to change "the value function may be different from the reward function" to "the value function may be different from the expected sum of rewards". Something like that? If so, I agree, you're right, I was wrong, I shouldn't be carelessly going back and forth between those things, and I'll change it.
My AGI Threat Model: Misaligned Model-Based RL Agent

(Much of this has been touched on already in our Discord conversation:)

Inner alignment problem: The value function might be different from the reward function.

In fact that’s an understatement: The value function will be different from the reward function. Why? Among other things, because they have different type signatures—they accept different input!

Surely this isn't relevant! We don't by any means want the value function to equal the reward function. What we want (at least in standard RL) is for the value function to be the solution to the dynamic progra... (read more)

3Steve Byrnes1moRE online learning, I acknowledge that a lot of reasonable people agree with you on that, and it's hard to know for sure. But I argued my position in Against evolution as an analogy for how humans will build AGI [https://www.lesswrong.com/posts/pz7Mxyr7Ac43tWMaC/against-evolution-as-an-analogy-for-how-humans-will-create] . Also there: a comment thread about why I'm skeptical that GPT-N would be capable of doing the things we want AGI to do, unless we fine-tune the weights on the fly, in a manner reminiscent of online learning (or amplification) [https://www.lesswrong.com/posts/pz7Mxyr7Ac43tWMaC/against-evolution-as-an-analogy-for-how-humans-will-create?commentId=PBeP8xm2BP5mvLSDb#PBeP8xm2BP5mvLSDb] .
1Steve Byrnes1moOK sure, that's fair. Point well taken. I was thinking about more brain-like neural nets that parse things into compositional pieces. If I wanted to be more prosaic maybe I would say something like: "She is differentiating both sides of the equation" could have a different value than "She is writing down a bunch of funny symbols", even if both are coming from the exact same camera inputs.
1Steve Byrnes1moThanks!! Hmm. I guess I have this ambiguous thing where I'm not specifying whether the value function is "valuing" world-states, or actions, or plans, or all of the above, or what. I think there are different ways to set it up, and I was trying not to get bogged down in details (and/or not being very careful!) Like, here's one extreme: imagine that the "planner" does arbitrarily-long-horizon rollouts of possible action sequences and their consequences in the world, and then the "value function" is looking at that whole future rollout and somehow encoding how good it is, and then you can choose the best rollout. In this case we do want the value function to converge to be (for all intents and purposes) a clone of the reward function. On the opposite extreme, when you're not doing rollouts at all, and instead the value function is judging particular states or actions, then I guess it should be less like the reward function and more like "expected upcoming reward assuming the current policy", which I think is what you're saying. Incidentally, I think the brain does both. Like, maybe I'm putting on my shoes because I know that this is the first step of a plan where I'll go to the candy store and buy candy and eat it. I'm motivated to put on my shoes by the image in my head where, a mere 10 minutes from now, I'll be back at home eating yummy candy. In this case, the value function is hopefully approximating the reward function, and specifically approximating what the reward function will do at the moment where I will eat candy. But maybe eventually, after many such trips to the candy store, it becomes an ingrained habit. And then I'm motivated to put on my shoes because my brain has cached the idea that good things are going to happen as a result—i.e., I'm motivated even if I don't explicitly visualize myself eating candy soon. I guess I spend more time thinking about the former (the value function is evaluating the eventual consequences of a plan) than the latter (
Prediction can be Outer Aligned at Optimum

I think this is pretty complicated, and stretches the meaning of several of the critical terms employed in important ways. I think what you said is reasonable given the limitations of the terminology, but ultimately, may be subtly misleading.

How I would currently put it (which I think strays further from the standard terminology than your analysis):

Take 1

Prediction is not a well-defined optimization problem.

Maximum-a-posteriori reasoning (with a given prior) is a well-defined optimization problem, and we can ask whether it's outer-aligned. The answer may b... (read more)

I tend to think of the latter as more compressed,

I'm not sure what you meant by "more compressed".

I used to define "agent" as "both a searcher and a controller", IE, something which uses an internal selection/search of some kind to accomplish an external control task. This might be too restrictive, though.

1Richard Ngo2moOh, I really like this definition. Even if it's too restrictive, it seems like it gets at something important. Sorry, that was quite opaque. I guess what I mean is that evolution is an optimiser but isn't an agent, and in part this has to do with how it's a very distributed process with no clear boundary around it. Whereas when you have the same problem being solved in a single human brain, then that compression makes it easier to point to the human as being an agent separate from its environment. The rest of this comment is me thinking out loud in a somewhat incoherent way; no pressure to read/respond. It seems like calling something a "searcher" describes only a very simple interface: at the end of the search, there needs to be some representation of the output which it has found. But that output may be very complex. Whereas calling something a "controller" describes a much more complex interface between it and its environment: you need to be able to point not just to outcomes, but also to observations and actions. But each of those actions is usually fairly simple for a pure controller; if it's complex, then you need search to find which action to take at each step. Now, it seems useful to sometimes call evolution a controller. For example, suppose you're trying to wipe out a virus, but it keeps mutating. Then there's a straightforward sense in which evolution is "steering" the world towards states where the virus still exists, in the short term. You could also say that it's steering the world towards states where all organisms have high fitness in the long term, but organisms are so complex that it's easier to treat them as selected outcomes, and abstract away from the many "actions" by evolution which led to this point. In other words, evolution searches using a process of iterative control. Whereas humans control using a process of iterative search. (As a side note, I'm now thinking that "search" isn't quite the right word, because there are other ways
Fun with +12 OOMs of Compute

If the AI and compute trend is just a blip, then doesn't that return us to the previous trend line in the graph you show at the beginning, where we progress about 2 ooms a decade? (More accurately, 1 oom every 6-7 years, or, 8 ooms in 5 decades.)

Ignoring AI and compute, then: if we believe +12 ooms in 2016 means great danger in 2020, we should believe that roughly 75 years after 2016, we are at most four years from the danger zone.

Whereas, if we extrapolate the AI-and-compute trend, +12 ooms is like jumping 12 years in the future; so the idea of risk by 2030 makes sense.

So I don't get how your conclusion can be so independent of AI-and-compute.

Identifiability Problem for Superrational Decision Theories

The "signals" players receive for correlated equilibria are already semantic. So I'm suspicious that they are better by calling on our intuition more to be used, with the implied risks. For example I remember reading about a result to the effect that correlated equilibria are easier to learn. This is not something we would expect from your explanation of the differences: If we explicitly added something (like the signals) into the game, it would generally get more complicated.

It's not something we would naively expect, but it does further speak in favor of... (read more)

My Current Take on Counterfactuals

I don't understand this part. Your explanation of PCDT at least didn't prepare me for it, it doesn't mention betting. And why is the payoff for the counterfactual-2-boxing determined by the beliefs of the agent after 1-boxing?

Not sure how to best answer. I'm thinking of all this in an LIDT setting, so all learning occurs through traders making bets. The payoff for 2-boxing is dependent on beliefs after 1-boxing because all share prices update every market day and the "payout" for a share is essentially what you can sell it for. Similarly, if a trader buys ... (read more)

1Bunthut1moIf a sentence is undecidable, then you could have two traders who disagree on its value indefinitely: one would have a highest price to buy, thats below the others lowest price to sell. But then anything between those two prices could be the "market price", in the classical supply and demand sense. If you say that the "payout" of a share is what you can sell it for... well, the "physical causation" trader is also buying shares on the counterfactual option that won't happen. And if he had to sell those, he couldn't sell them at a price close to where he bought them - he could only sell them at how much the "logical causation" trader values them, and so both would be losing "payout" on their trades with the unrealized option. Thats one interpretation of "sell". If theres a "market maker" in addition to both traders, it depends on what prices he makes - and as outlined above, there is a wide range of prices that would be consistent for him to offer as a market maker, including ways which are very close to the logical traders valuations - in which case, the logical trader is gaining on the physical one. Trying to communicate a vague intuition here: There is a set of methods which rely on there being a time when "everything is done", to then look back from there and do credit assignment for everything that happened before. They characteristically use backwards induction to prove things. I think markets fall into this: the argument for why ideal markets don't have bubbles is that eventually, the real value will be revealed, and so the bubble has to pop, and then someone holds the bag, and you don't want to be that someone, and people predicting this and trying to avoid it will make the bubble pop earlier, in the idealised case instantly. I also think these methods aren't going to work well with embedding. They essentially use "after the world" as a subsitute for "outside the world". My question was more "how should this roughly work" rather than "what conditions should

(a) Maybe the deceptive ticket that makes T' work is indeed there from the beginning, but maybe it's outnumbered by 'benign' tickets, so that the overall behavior of the network is benign. This is an argument against premise 4, the idea being that even though the deceptive ticket scores just as well as the rest, it still loses out because it is outnumbered.

My overall claim is that attractor-basin type arguments need to address the base case. This seems like a potentially fine way to address the base-case, if the math works out for whatever specific attract... (read more)

I'm a bit confused about part of what we're disagreeing on, so, context trace:

I originally said:

My model is that GPT-3 almost certainly is "hiding its intelligence" at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will 'intentionally' continue with more spelling mistakes in what it generates.

Then you said:

Yeah, because it's goal is prediction. Within prediction there isn't a right way to write a sentence. It's not a spelling mistake, it's a spelling prediction. (If you want it to not do that, then train it on...predict

I'm inclined to think so, mostly because terms shouldn't be introduced unnecessarily. If we can already talk about systems that are capable/competent at certain tasks, then we should just do that directly.

Thinking about this more, I think maybe what I really want it to mean is: competent policies which are non-myopic in some sense. A truly myopic Q&A system doesn't feel much like a controller / inner optimizer (even if it is misaligned, it's not steering the world in a bad direction, because it's totally myopic).

I'm not sure what sense of "myopia" I want to use, though.

1Richard Ngo2moTo me it sounds like you're describing (some version of) agency, and so the most natural term to use would be mesa-agent. I'm a bit confused about the relationship between "optimiser" and "agent", but I tend to think of the latter as more compressed, and so insofar as we're talking about policies it seems like "agent" is appropriate. Also, mesa-optimiser is taken already (under a definition which assumes that optimisation is equivalent to some kind of internal search).

I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let's say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.

I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.

1Adam Shimi2moAgreed, it depends on the training process.

Also, why include "misaligned" in this definition? If mesa-controller turns out to be a useful concept, then I'd want to talk about both aligned and misaligned mesa-controllers.

Right, agreed, I'll consider editing.

I'm confused about what wouldn't qualify as a mesa-controller. In practice, is this not synonymous with "capable"?

Do you think that's a problem?

Do you think that's a problem?

I'm inclined to think so, mostly because terms shouldn't be introduced unnecessarily. If we can already talk about systems that are capable/competent at certain tasks, then we should just do that directly.

I guess the mesa- prefix helps point towards the fact that we're talking about policies, not policies + optimisers.

Probably my preferred terminology would be:

• Instead of mesa-controller, "competent policy".
• And then we can say that competent policies sometimes implement search or learning (or both, or possibly neither).
• And when

But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can't expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.

I guess the difference is that instead of the deceptive region of the model space, it's the "your innate deceptiveness has won" region of the model space?

Right, so, the point of the argument for basin-like proposals is this:

A basin-type solution has to 1. initialize in such a way as to be within a good... (read more)

3Adam Shimi2moI guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let's say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception. On the other hand, if there's just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don't make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That's the intuition that to learn that lying is a useful strategy, you must actually be "good enough" at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to). If you actually need deceptiveness to be strong already to have this issue, then I don't think your ELH points to a problem because I don't see why deceptiveness should dominate already.