David Xu

Posts

Sorted by New

Wiki Contributions

Comments

Yeah, I'm not actually convinced humans are "aligned under reflection" in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:

You have just done a lot of steps, many of which involved reflection, with no particular way to get 'back on track' if you've done some of them in goofy ways

[...]

If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it'd be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).

It certainly seems to me that e.g. people like Ziz have done reflection in a "goofy" way, and that being human has not particularly saved them from deriving "crazy stuff". Of course, humans doing reflection would still be confined to a subset of the mental moves being done by crazy minds made out of gradient descent on matrix multiplication, but it's currently plausible to me that part of the danger arises simply from "reflection on (partially) incoherent starting points" getting really crazy really fast.

(It's not yet clear to me how this intuition interfaces with my view on alignment hopes; you'd expect it to make things worse, but I actually think this is already "priced in" w.r.t. my P(doom), so explicating it like this doesn't actually move me—which is about what you'd expect, and strive for, as someone who tries to track both their object-level beliefs and the implications of those beliefs.)

(EDIT: I mean, a lot of what I'm saying here is basically "CEV" might not be so "C", and I don't actually think I've ever bought that to begin with, so it really doesn't come as an update for me. Still worth making explicit though, IMO.)

If you notice and penalize that just because the action is bad, without ever figuring out whether the underlying motivation was bad or not, that still selects against models with bad motivations.

It's plausible that you then get a model with bad motivations that knows not to produce bad actions until it is certain those will not be caught. But it's also plausible that you just get a model with good motivations. I think the more you succeed at noticing bad actions (or good actions for bad reasons) the more likely you should think good motivations are.

but, but, standard counterargument imperfect proxies Goodharting magnification of error adversarial amplification etc etc etc?

(It feels weird that this is a point that seems to consistently come up in discussions of this type, considering how basic of a disagreement it really is, but it really does seem to me like lots of things come back to this over and over again?)

The section about identifying mesa-optimizers (or, more precisely, mesa-optimization strategies) by fingerprinting their generalization behavior on out-of-distribution inputs looks very promising to me. It looks like the rare kind of strategy that directly attacks the core of the misbehavior, and (on first pass) looks to me like it ought to just work, provided sufficient variance in the OOD "test" inputs.

Strong-upvoted for that alone, and I'd further like to request replies with potential defeaters that could curb my current optimism for this approach.

This is a very interesting point! I will chip in by pointing out a very similar remark from Rohin just earlier today:

And I'll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.

That is all.

(Obviously there's a kinda superficial resemblance here to the phenomenon of "calling out" somebody else; I want to state outright that this is not the intention, it's just that I saw your comment right after seeing Rohin's comment, in such a way that my memory of his remark was still salient enough that the connection jumped out at me. Since salient observations tend to fade over time, I wanted to put this down before that happened.)

In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.

You (a human) already exhibit #2-style trying. Despite this, you are not capable of "establishing a stable governance regime that regulates AI development" or "doing alignment research better than any existing human alignment researchers" (the latter is tautologically true, even).

So it seems reasonable to conclude that this level of "trying" is not enough to enact the pivotal acts you described (or, indeed, most any pivotal act that we might recognize as "pivotal"). It then follows that if a system is capable enough to enact some such pivotal act, some part of that system must have been running a stronger search than the kind of search described in "#2-style trying". And if you buy Eliezer's/Nate's argument that it's the search itself that's dangerous, rather than the fact that you (maybe) wrapped up the search in an outer shell you happen to call "oracle AI" (or something), then it's not a large jump from there to "maybe the search decides 'killing all humans' rates highly according to its search criteria".

But perhaps you're conceptualizing this whole "trying" thing differently, because you go on to say:

Idk, I'm also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that's more efficient than scaled-up reflex-like things).

which actually just does not parse in my native ontology. Like, in my ontology "sufficiently scaled-up reflex-like things" stop behaving reflexively. It's not that you have this abstract label "reflex-like", that you can slap onto some system, such that if you then scale that system up the label stays stuck to it indefinitely; in my model reflexiveness is a property of actions, not of systems, and if you make a system sufficiently powerful it leaves the regime where reflex-like behavior is its default. It automatically goes from #1 to #2 to #3 to #4 in the limit of sufficient scaling; this is, from my perspective, what is meant by the claim "these things exist on a continuum" (which claim it seems like you agreed with in a parallel comment thread, which simply furthers my confusion).

From my (dxu's) perspective, it's allowable for there to be "deep fundamental theories" such that, once you understand those theories well enough, you lose the ability to imagine coherent counterfactual worlds where the theories in question are false.

To use thermodynamics as an example: the first law of thermodynamics (conservation of energy) is actually a consequence of Noether's theorem, which ties conserved quantities in physics to symmetries in physical laws. Before someone becomes aware of this, it's perhaps possible for them to imagine a universe exactly like our own, except that energy is not conserved; once they understand the connection implied by Noether's theorem, this becomes an incoherent notion: you cannot remove the conservation-of-energy property without changing deep aspects of the laws of physics.

The second law of thermodynamics is similarly deep: it's actually a consequence of there being a (low-entropy) boundary condition at the beginning of the universe, but no corresponding (low-entropy) boundary condition at any future state. This asymmetry in boundary conditions is what causes entropy to appear directionally increasing--and again, once someone becomes aware of this, it is no longer possible for them to imagine living in a universe which started out in a very low-entropy state, but where the second law of thermodynamics does not hold.

In other words, thermodynamics as a "deep fundamental theory" is not merely [what you characterized as] a "powerful abstraction that is useful in a lot of domains". Thermodynamics is a logically necessary consequence of existing, more primitive notions--and the fact that (historically) we arrived at our understanding of thermodynamics via a substantially longer route (involving heat engines and the like), without noticing this deep connection until much later on, does not change the fact that grasping said deep connection allows one to see "at a glance" why the laws of thermodynamics inevitably follow.

Of course, this doesn't imply infinite certainty, but it does imply a level of certainty substantially higher than what would be assigned merely to a "powerful abstraction that is useful in a lot of domains". So the relevant question would seem to be: given my above described epistemic state, how might one convince me that the case for thermodynamics is not as airtight as I currently think it is? I think there are essentially two angles of attack: (1) convince me that the arguments for thermodynamics being a logically necessary consequence of the laws of physics are somehow flawed, or (2) convince me that the laws of physics don't have the properties I think they do.

Both of these are hard to do, however--and for good reason! And absent arguments along those lines, I don't think I am (or should be) particularly moved by [what you characterized as] philosophy-of-science-style objections about "advance predictions", "systematic biases", and the like. I think there are certain theories for which the object-level case is strong enough that it more or less screens off meta-level objections; and I think this is right, and good.

Which is to say:

The mental move I'm doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you're describing something that doesn't commute, addition might be a deep theory, but it's not useful for what you want. Similarly, you could argue that given how we're building AIs and trying to build AGI, evolution is not the deep theory that you want to use. (emphasis mine)

I think you could argue this, yes--but the crucial point is that you have to actually argue it. You have to (1) highlight some aspect of the evolutionary paradigm, (2) point out [what appears to you to be] an important disanalogy between that aspect and [what you expect cognition to look like in] AGI, and then (3) argue that that disanalogy directly undercuts the reliability of the conclusions you would like to contest. In other words, you have to do things the "hard way"--no shortcuts.

...and the sense I got from Richard's questions in the post (as well as the arguments you made in this subthread) is one that very much smells like a shortcut is being attempted. This is why I wrote, in my other comment, that

I don't think I have a good sense of the implied objections contained within Richard's model. That is to say: I don't have a good handle on the way(s) in which Richard expects expected utility theory to fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because--absent a strong model of expected utility theory's likely failure modes--I don't think questions of the form "but why hasn't your theory made a lot of successful advance predictions yet?" move me very much on the object level.

I think I share Eliezer's sense of not really knowing what Richard means by "deep fundamental theory" or "wide range of applications we hadn't previous thought of", and I think what would clarify this for me would have been for Richard to provide examples of "deep fundamental theories [with] a wide range of applications we hadn't previously thought of", accompanied by an explanation of why, if those applications hadn't been present, that would have indicated something wrong with the theory.

Speaking from my own perspective: I definitely had a sense, reading through that section of the conversation, that Richard's questions were somewhat... skewed? ... relative to the way I normally think about the topic. I'm having some difficulty articulating the source of that skewness, so I'll start by talking about how I think the skewness relates to the conversation itself:

I interpreted Eliezer's remarks as basically attempting to engage with Richard's questions on the same level they were being asked--but I think his lack of ability to come up with compelling examples (to be clear: by "compelling" here I mean "compelling to Richard") likely points at a deeper source of disagreement (which may or may not be the same generator as the "skewness" I noticed). And if I were forced to articulate the thing I think the generator might be...

I don't think I have a good sense of the implied objections contained within Richard's model. That is to say: I don't have a good handle on the way(s) in which Richard expects expected utility theory to fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because--absent a strong model of expected utility theory's likely failure modes--I don't think questions of the form "but why hasn't your theory made a lot of successful advance predictions yet?" move me very much on the object level.

Probing more at the sense of skewness, I'm getting the sense that this exchange here is deeply relevant:

Richard: I'm accepting your premise that it's something deep and fundamental, and making the claim that deep, fundamental theories are likely to have a wide range of applications, including ones we hadn't previously thought of.

Do you disagree with that premise, in general?

Eliezer: I don't know what you really mean by "deep fundamental theory" or "wide range of applications we hadn't previously thought of", especially when it comes to structures that are this simple. It sounds like you're still imagining something I mean by Expected Utility which is some narrow specific theory like a particular collection of gears that are appearing in lots of places.

I think I share Eliezer's sense of not really knowing what Richard means by "deep fundamental theory" or "wide range of applications we hadn't previous thought of", and I think what would clarify this for me would have been for Richard to provide examples of "deep fundamental theories [with] a wide range of applications we hadn't previously thought of", accompanied by an explanation of why, if those applications hadn't been present, that would have indicated something wrong with the theory.

But the reason I'm calling the thing "skewness", rather than something more prosaic like "disagreement", is because I suspect Richard isn't actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of "successful advance predictions"/"wide applications" corresponds to those likely failure modes, etc). I suspect that the frame Richard is operating in would dismiss these questions as largely inconsequential, even though I'm not sure why or what that frame actually looks like; this is a large part of the reason why I have this flagged as a place to look for a deep hidden crux.

(One [somewhat uncharitable] part of me wants to point out that the crux in question may actually just be the "usual culprit" in discussions like this: outside-view/modest-epistemology-style reasoning. This does seem to rhyme a lot with what I wrote above, e.g. it would explain why Richard didn't seem particularly concerned with gears-level failure modes or competing models or the like, and why his line of questioning seemed mostly insensitive to the object-level details of what "advance predictions" look like, why that matters, etc. I do note that Richard actively denied being motivated by this style of reasoning later on in the dialogue, however, which is why I still have substantial uncertainty about his position.)

Like, there's a certain kind of theory/model which generalizes well to many classes of new cases and makes nontrivial predictions in those new cases, and those kinds-of-theories/models have a pattern to them which is recognizable.

Could I ask you to say more about what you mean by "nontrivial predictions" in this context? It seems to me like this was a rather large sticking point in the discussion between Richard and Eliezer (that is, the question of whether expected utility theory--as a specific candidate for a "strongly generalizing theory"--produces "nontrivial predictions", where it seemed like Eliezer leaned "yes" and Richard leaned "no"), so I'd be interested in hearing more takes on what constitutes "nontrivial predictions", and what role said (nontrivial) predictions play in making a theory more convincing (as compared to other factors such as e.g. elegance/parsimony/[the pattern John talks about which is recognizable]).

Of course, I'd be interested in hearing what Richard thinks of the above as well.

[Note: Still speaking from my Eliezer model here, in the sense that I am making claims which I do not myself necessarily endorse (though naturally I don't anti-endorse them either, or else I wouldn't be arguing them in the first place). I want to highlight here, however, that to the extent that the topic of the conversation moves further away from things I have seen Eliezer talk about, the more I need to guess about what I think he would say, and at some point I think it is fair to describe my claims as neither mine nor (any model of) Eliezer's, but instead something like my extrapolation of my model of Eliezer, which may not correspond at all to what the real Eliezer thinks.]

> To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

The sense that it's still myopic is in the sense that it's non-deceptive, which is the only sense that we actually care about.

> it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper

The safety improvement that I'm claiming is that it wouldn't be deceptive. What is the mechanism by which you think a myopic agent would end up acting deceptively?

If the underlying process your myopic agent was trained to imitate would (under some set of circumstances) be incentivized to deceive you, and the myopic agent (by hypothesis) imitates the underlying process to sufficient resolution, why would the deceptive behavior of the underlying process not be reflected in the behavior of the myopic agent?

Conversely, if the myopic agent does not learn to imitate the underlying process to sufficient resolution that unwanted behaviors like deception start carrying over, then it is very likely that the powerful consequentialist properties of the underlying process have not been carried over, either. This is because (on my extrapolation of Eliezer's model) deceptive behavior, like all other instrumental strategies, arises from consequentialist reasoning, and is deeply tied to such reasoning in a way that is not cleanly separable—which is to say, by default, you do not manage to sever one without also severing the other.

Again, I (my model of Eliezer) does not think the "deep tie" in question is necessarily insoluble; perhaps there is some sufficiently clever method which, if used, would successfully filter out the "unwanted" instrumental behavior ("deception", in your terminology) from the "wanted" instrumental behavior (planning, coming up with strategies, in general being an effective agent in the real world). But this distinction between "wanted" and "unwanted" is not a natural distinction; it is, in fact, a distinction highly entangled with human concepts and human values, and any "filter" that selects based on said distinction will need to be of similar complexity. (Of identical complexity, in fact, to the whole alignment problem.) "Simple" filters like the thing you are calling "myopia" definitely do not suffice to perform this function.

I'd be interested in hearing which aspect(s) of the above model you disagree with, and why.

It still doesn't seem to me like you've sufficiently answered the objection here.

I tend to think that HCH is not dangerous, but I agree that it's likely insufficiently capable. To solve that problem, we have to do go to a myopic objective that is more powerful.

What if any sufficiently powerful objective is non-myopic? Or, on a different-but-equivalent phrasing: what if myopia is a property only of very specific toy objectives, rather than a widespread property of objectives in general (including objectives that humans would intuitively consider to be aimed at accomplishing things "in the real world")?

It seems to me that Eliezer has presented quite compelling arguments that the above is the case, and on a first pass it doesn't look to me like you've countered those arguments.

But that's not that hard, and there's lots of them that can incentivize good non-myopic behavior that are safe to optimize for as long as the optimizer is myopic.

How does a "myopic optimizer" successfully reason about problems that require non-myopic solutions, i.e. solutions whose consequences extend past whatever artificial time-frame the optimizer is being constrained to reason about? To the extent that it does successfully reason about those things in a non-myopic way, in what remaining sense is the optimizer myopic?

AI safety via market making is one example, but it's a very tricky one, so maybe not the best candidate for showcasing what I mean. In particular, I suspect that a myopic optimizer given the goal of acting as a trader or market-maker in such a setup wouldn't act deceptively, though I suspect they would Goodhart on the human approval signal in unsafe ways (which is less bad of a problem than deception, and could potentially be solved via something like my step (6), but still a pretty serious problem).

Maybe a better example would be something like imitative generalization. If imitating HCH is insufficient, we can push further by replacing “imitate HCH” with “output the hypothesis which maximizes HCH's prior times the hypothesis's likelihood,” which gets you substantially farther and I think is still safe to optimize for given a myopic optimizer (though neither are safe for a non-myopic optimizer).

Both of these seem to be examples of solutions that simply push the problem back a step, rather than seeking to eliminate it directly. My model of Eliezer would call this attempting to manipulate confusion, and caution that, although adding more gears to your perpetual motion machine might make the physics-violating component harder to pick out, it does not change the fact that somewhere within the model is a step that violates physics.

In this case, it seems as though all of your proposals are of the form "Train your model to imitate some process X (where X is non-myopic and potentially unsafe), while adding incentives in favor of myopic behavior during training." To which my model of Eliezer replies, "Either your model will end up myopic, and not powerful enough to capture the part of X that actually does the useful work we are interested in, or it ends up imitating X in full (non-myopic) generality, in which case you have not managed to achieve any kind of safety improvement over X proper."

It seems to me that to usefully refute this, you need to successfully argue against Eliezer's background premise here—the one about power and non-myopic reasoning going hand-in-hand in a deep manner that, while perhaps circumventable via similarly deep insights, is not patchable via shallow methods like "Instead of directly using dangerous process X, we will imitate X, thereby putting an extra layer of abstraction between ourselves and the danger." My current impression is that you have not been arguing against this background premise at all, and as such I don't think your arguments hit at the core of what makes Eliezer doubt your proposals.

Load More