All of dxu's Comments + Replies

Yeah, I'm not actually convinced humans are "aligned under reflection" in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:

You have just done a lot of steps, many of which involved reflection, with no particular way to get 'back on track' if you've done some of them in goofy ways


If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it'd be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).

It certainly seems to me that e.... (read more)

I hear you on this concern, but it basically seems similar (IMO) to a concern like: "The future of humanity after N more generations will be ~without value, due to all the reflection humans will do - and all the ways their values will change - between now and then." A large set of "ems" gaining control of the future after a lot of "reflection" seems like quite comparable to future humans having control over the future (also after a lot of effective "reflection"). I think there's some validity to worrying about a future with very different values from today's. But I think misaligned AI is (reasonably) usually assumed to diverge in more drastic and/or "bad" ways than humans themselves would if they stayed in control; I think of this difference as the major driver of wanting to align AIs at all. And it seems Nate thinks that the hypothetical training process I outline above gets us something much closer to "misaligned AI" levels of value divergence than to "ems" levels of value divergence.

If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.

It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.

but, bu... (read more)

2Rohin Shah6mo
Indeed I am confused why people think Goodharting is effectively-100%-likely to happen and also lead to all the humans dying. Seems incredibly extreme. All the examples people give of Goodharting do not lead to all the humans dying. (Yes, I'm aware that the arguments are more sophisticated than that and "previous examples of Goodharting didn't lead to extinction" isn't a rebuttal to them, but that response does capture some of my attitude towards the more sophisticated arguments, something like "that's a wildly strong conclusion you've drawn from a pretty handwavy and speculative argument".) -------------------------------------------------------------------------------- Ultimately I think you just want to compare various kinds of models and ask how likely they are to arise (assuming you are staying within the scaled up neural networks as AGI paradigm). Some models you could consider: 1. The idealized aligned model, which does whatever it thinks is best for humans 2. The savvy aligned model, which wants to help humans but knows that it should play into human biases (e.g. by being sycophantic) in order to get high reward and not be selected against by gradient descent 3. The deceptively aligned model, which wants some misaligned goal (say paperclips), but knows that it should behave well until it can execute a treacherous turn 4. The bag of heuristics model, which (like a human) has a mix of various motivations, and mostly executes past strategies that have worked out well, imitating many of them from broader culture, without a great understanding of why they work, which tends to lead to high reward without extreme consequentialism. (Really I think everything is going to be (4) until significantly past human-level, but will be on a spectrum of how close they are to (2) or (3).) Plausibly you don't get (1) because it doesn't get particularly high reward relative to the others. But (2), (3) and (4) all seem like they could

The section about identifying mesa-optimizers (or, more precisely, mesa-optimization strategies) by fingerprinting their generalization behavior on out-of-distribution inputs looks very promising to me. It looks like the rare kind of strategy that directly attacks the core of the misbehavior, and (on first pass) looks to me like it ought to just work, provided sufficient variance in the OOD "test" inputs.

Strong-upvoted for that alone, and I'd further like to request replies with potential defeaters that could curb my current optimism for this approach.

I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:

  • If I understand correctly, we'd need a model that we're confident is a mesa-optimizer (and perhaps even deceptive---mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are "thresholds" where slight changes have big effects on how dangerous a model is.
  • If there's a very strong inductive bias towards deception,
... (read more)

I agree. From the moment JDP suggested this idea it struck me as one of the first implementable proposals I'd seen which might actually attack the core of the control problem. My intuition also says it's pretty likely to just work, especially after these results. And even if it doesn't end up working as planned, the way in which it fails will give us important insight about training dynamics and/or generalization. Experiments which will give you valuable information whatever the outcome are the type we should be aiming for.

It's one of those things that we'd be plainly undignified not to try.

I believe that JDP is planning to publish a post explaining his proposal in more detail soon.

This is a very interesting point! I will chip in by pointing out a very similar remark from Rohin just earlier today:

And I'll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.

That is all.

(Obviously there's a kinda superficial resemblance here to the phenomenon of "calling out" somebody else; I want to state outright that this is ... (read more)

3Matthew "Vaniver" Gray2y
Yeah, I'm also interested in the question of "how do we distinguish 'sentences-on-mainline' from 'shoring-up-edge-cases'?", or which conversational moves most develop shared knowledge, or something similar.  Like I think it's often good to point out edge cases, especially when you're trying to formalize an argument or look for designs that get us out of this trap. In another comment in this thread, I note that there's a thing Eliezer said that I think is very important and accurate, and also think there's an edge case that's not obviously handled correctly.  But also my sense is that there's some deep benefit from "having mainlines" and conversations that are mostly 'sentences-on-mainline'? Or, like, there's some value to more people thinking thru / shooting down their own edge cases (like I do in the mentioned comment), instead of pushing the work to Eliezer. I'm pretty worried that there are deeply general reasons to expect AI alignment to be extremely difficult, people aren't updating on the meta-level point and continue to attempt 'rolling their own crypto', asking if Eliezer can poke the hole in this new procedure, and if Eliezer ever decides to just write serial online fiction until the world explodes humanity hasn't developed enough capacity to replace him.

In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a soluti

... (read more)

(I endorse dxu's entire reply.)

6Rohin Shah2y
Stated differently than how I'd say it, but I agree that a single human performing human-level reasoning is not enough to enact those pivotal acts. Yeah, in my ontology (and in this context) reflexiveness is a property of cognitions, not of actions. I can reflexively reach into a transparent pipe to pick up a sandwich, without searching over possible plans for getting the sandwich (or at least, without any conscious search, and without any search via trying different plans and seeing if they work); one random video I've seen suggests that (some kind of) monkeys struggle to do this and may have to experiment with different plans to get the food. (I use this anecdote as an illustration; I don't know if it is actually true.) See also the first few sections of Argument, intuition, and recursion; in the language of that post I'm thinking of "explicit argument" as "trying", and "intuition" as "reflex-like", even though they output the same thing. Within my ontology, you could define behavioral-reflexivity as those behaviors / actions that a human could do with reflexive cognition, and then more competent actions are behavioral-trying. These concepts might match yours. In that case I'm saying that it's plausible that there's a wide gap between behavioral-trying-2 and behavioral-trying-3, but really my intuition is coming much more from finding the trying-2 cognitions significantly more likely than the trying-3 cognitions, and thinking that the trying-2 cognitions could scale without becoming trying-3 cognitions. Or, to try and say things a bit more concretely, I find it plausible that there is more scaling from improving the efficiency of the search (e.g. by having better tuned heuristics and intuitions), than from expanding the domain of possible plans considered by the search. The 4 styles of trying that Rob mentioned exist on a continuum like "domain of possible plans", but instead we mostly walk up the continuum of "efficiency / competence of search within the doma

From my (dxu's) perspective, it's allowable for there to be "deep fundamental theories" such that, once you understand those theories well enough, you lose the ability to imagine coherent counterfactual worlds where the theories in question are false.

To use thermodynamics as an example: the first law of thermodynamics (conservation of energy) is actually a consequence of Noether's theorem, which ties conserved quantities in physics to symmetries in physical laws. Before someone becomes aware of this, it's perhaps possible for them to imagine a universe exa... (read more)

1Luke H Miles2y
The difference between evolution and gradient descent is sexual selection and predator/prey/parasite relations. Agents running around inside everywhere -- completely changes the process. Likewise for comparing any kind of flat optimization or search to evolution. I think sexual selection and predator-prey made natural selection dramatically more efficient. So I think it's pretty fair to object that you don't take evolution as adequate evidence to expect this flat, dead, temporary number cruncher will blow up in exponential intelligence. I think there are other reasons to expect that though. I haven't read these 500 pages of dialogues so somebody probably made this point already.

My objection is mostly fleshed out in my other comment. I'd just flag here that "In other words, you have to do things the "hard way"--no shortcuts" assigns the burden of proof in a way which I think is not usually helpful. You shouldn't believe my argument that I have a deep theory linking AGI and evolution unless I can explain some really compelling aspects of that theory. Because otherwise you'll also believe in the deep theory linking AGI and capitalism, and the one linking AGI and symbolic logic, and the one linking intelligence and ethics, and the on... (read more)

Damn. I actually think you might have provided the first clear pointer I've seen about this form of knowledge production, why and how it works, and what could break it. There's a lot to chew on in this reply, but thanks a lot for the amazing food for thought!

(I especially like that you explained the physical points and put links that actually explain the specific implication)

And I agree (tentatively) that a lot of the epistemology of science stuff doesn't have the same object-level impact. I was not claiming that normal philosophy of science was required, just that if that was not how we should evaluate and try to break the deep theory, I wanted to understand how I was supposed to do that.

Speaking from my own perspective: I definitely had a sense, reading through that section of the conversation, that Richard's questions were somewhat... skewed? ... relative to the way I normally think about the topic. I'm having some difficulty articulating the source of that skewness, so I'll start by talking about how I think the skewness relates to the conversation itself:

I interpreted Eliezer's remarks as basically attempting to engage with Richard's questions on the same level they were being asked--but I think his lack of ability to come up with comp... (read more)

Strong upvote, you're pointing at something very important here. I don't think I'm defending epistemic modesty, I think I'm defending epistemic rigour, of the sort that's valuable even if you're the only person in the world.

I suspect Richard isn't actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of "successful advance predictions"/"wide applications" corresponds to those likely failure modes, etc).

Yes, this is corre... (read more)

Like, there's a certain kind of theory/model which generalizes well to many classes of new cases and makes nontrivial predictions in those new cases, and those kinds-of-theories/models have a pattern to them which is recognizable.

Could I ask you to say more about what you mean by "nontrivial predictions" in this context? It seems to me like this was a rather large sticking point in the discussion between Richard and Eliezer (that is, the question of whether expected utility theory--as a specific candidate for a "strongly generalizing theory"--produces "non... (read more)

Oh, I can just give you a class of nontrivial predictions of expected utility theory. I have not seen any empirical results on whether these actually hold, so consider them advance predictions.

So, a bacteria needs a handful of different metabolic resources - most obviously energy (i.e. ATP), but also amino acids, membrane lipids, etc. And often bacteria can produce some metabolic resources via multiple different paths, including cyclical paths - e.g. it's useful to be able to turn A into B but also B into A, because sometimes the environment will have lots... (read more)

Speaking from my own perspective: I definitely had a sense, reading through that section of the conversation, that Richard's questions were somewhat... skewed? ... relative to the way I normally think about the topic. I'm having some difficulty articulating the source of that skewness, so I'll start by talking about how I think the skewness relates to the conversation itself:

I interpreted Eliezer's remarks as basically attempting to engage with Richard's questions on the same level they were being asked--but I think his lack of ability to come up with comp... (read more)

[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don't anti-endorse them either, or else I wouldn't be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer's, but instea... (read more)

If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?

Yeah, this is obviously true. Certainly if you have an objective of imitating something that would act deceptively, you'll get deception. The solution isn't to somehow “filter out the unwanted instrumental behavio... (read more)

It still doesn't seem to me like you've sufficiently answered the objection here.

I tend to think that HCH is not dangerous, but I agree that it's likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful.

What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed... (read more)

3Evan Hubinger2y
It just reasons about them, using deduction, prediction, search, etc., the same way any optimizer would. The sense that it's still myopic is in the sense that it's non-deceptive, which is the only sense that we actually care about. The safety improvement that I'm claiming is that it wouldn't be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?

So, the point of my comments was to draw a contrast between having a low opinion of "experimental work and not doing only decision theory and logic", and having a low opinion of "mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc." I didn't intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as 'oh yeah, th

... (read more)
6Rohin Shah2y
^ This response is great. I also think I naturally interpreted the terms in Adam's comment as pointing to specific clusters of work in today's world, rather than universal claims about all work that could ever be done. That is, when I see "experimental work and not doing only decision theory and logic", I automatically think of "experimental work" as pointing to a specific cluster of work that exists in today's world (which we might call mainstream ML alignment), rather than "any information you can get by running code". Whereas it seems you interpreted it as something closer to "MIRI thinks there isn't any information to get by running code". My brain insists that my interpretation is the obvious one and is confused how anyone (within the AI alignment field, who knows about the work that is being done) could interpret it as the latter. (Although the existence of non-public experimental work that isn't mainstream ML is a good candidate for how you would start to interpret "experimental work" as the latter.) But this seems very plausibly a typical mind fallacy. EDIT: Also, to explicitly say it, sorry for misunderstanding what you were trying to say. I did in fact read your comments as saying "no, MIRI is not categorically against mainstream ML work, and MIRI is not only working on HRAD-ish stuff like decision theory and logic, and furthermore this should be pretty obvious to outside observers", and now I realize that is not what you were saying.
1Rob Bensinger2y
This is a good comment! I also agree that it's mostly on MIRI to try to explain its views, not on others to do painstaking exegesis. If I don't have a ready-on-hand link that clearly articulates the thing I'm trying to say, then it's not surprising if others don't have it in their model. And based on these comments, I update that there's probably more disagreement-about-MIRI than I was thinking, and less (though still a decent amount of) hyperbole/etc. If so, sorry about jumping to conclusions, Adam!

Thanks for elaborating. I don't think I have the necessary familiarity with the alignment research community to assess your characterization of the situation, but I appreciate your willingness to raise potentially unpopular hypotheses to attention. +1

2Adam Shimi2y
Thanks for taking the time of asking a question about the discussion even if you lack expertise on the topic. ;)

Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.

I'm... confused by this framing? Specifically, this bit (as well as other bits like these)

I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like

... (read more)

(Later added disclaimer: it's a good idea to add "I feel like..." before the judgment in this comment, so that you keep in mind that I'm talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))

Okay, so you're completely right that a lot of my points are logically downstream of the debate on whether Prosaic Alignment is Impossible or not. But I feel like you don't get how one sided this debate is, and how misrepresented it is here (and generally on the AF)

Like nobody except EY and a bunch of core ... (read more)

Eliezer Yudkowsky

Alpha Zero scales with more computing power, I think AlphaFold 2 scales with more computing power, Mu Zero scales with more computing power. Precisely because GPT-3 doesn't scale, I'd expect an AGI to look more like Mu Zero and particularly with respect to the fact that it has some way of scaling.

I think this view dovetails quite strongly with the view expressed in this comment by maximkazhenkov:

Progress in model-based RL is far more relevant to getting us closer to AGI than other fields like NLP or image recognition or neuroscienc

... (read more)

The output of this process is something people have taken to calling Son-of-CDT; the problem (insofar as we understand Son-of-CDT well enough to talk about its behavior) is that the resulting decision theory continues to neglect correlations that existed prior to self-modification.

(In your terms: Alice and Bob would only one-box in Newcomb variants where Omega based his prediction on them after they came up with their new decision theory; Newcomb variants where Omega's prediction occurred before they had their talk would still be met with two-boxing, even ... (read more)

One particular example of this phenomenon that comes to mind:

In (traditional) chess-playing software, generally moves are selected using a combination of search and evaluation, where the search is (usually) some form of minimax with alpha-beta pruning, and the evaluation function is used to assign a value estimate to leaf nodes in the tree, which are then propagated to the root to select a move.

Typically, the evaluation function is designed by humans (although recent developments have changed that second part somewhat) to reflect meaningful features of che... (read more)

This is one hell of a good comment! Strong-upvoted.

If it's read moral philosophy, it should have some notion of what the words "human values" mean.

GPT-3 and systems like it are trained to mimic human discourse. Even if (in the limit of arbitrary computational power) it manages to encode an implicit representation of human values somewhere in its internal state, in actual practice there is nothing tying that representation to the phrase "human values", since moral philosophy is written by (confused) humans, and in human-written text the phrase "human values" is not used in the consistent, coherent manner that would be required to infer its use as a label for a fixed concept.

1John Maxwell3y
This is essentially the "tasty ice cream flavors" problem, am I right?  Trying to check if we're on the same page. If so: John Wentsworth said So how about instead of talking about "human values", we talk about what a particular moral philosopher endorses saying or doing, or even better, what a committee of famous moral philosophers would endorse saying/doing.
1 and 2 are hard to succeed at without making a lot of progress on 4

It's not obvious to me why this ought to be the case. Could you elaborate?

1Alex Turner4y
I also am curious why this should be so. I also continue to disagree with Stuart on low impact in particular being intractable without learning human values.
If there's some kind of measure of "observer weight" over the whole mathematical universe, we might be already much larger than 1/3^^^3 of it, so the total utilitarian can only gain so much.

Could you provide some intuition for this? Naively, I'd expect our "observer measure" over the space of mathematical structures to be 0.

The weight could be something like the algorithmic probability over strings(, in which case universes like ours with a concise description would get a fairly large chunk of the weight.