TsviBT — AI Alignment Forum

Alignment remains a hard, unsolved problem

My argument, though, is that it is still very possible for the difficulty of alignment to be in the Apollo regime, and that we haven't received much evidence to rule that regime out (I am somewhat skeptical of a P vs. NP level of difficulty, though I think it could be close to that).

Are you skeptical of PvNP-level due to priors or due to evidence? Why those priors / what evidence?

(I think alignment is pretty likely to be much harder than PvNP. Mainly this is because alignment is very very difficult. (Though also note that PvNP has a maybe-possibly-workable approach, https://en.wikipedia.org/wiki/Geometric_complexity_theory, which its creator states might take a mere one century, though I presume that's not a serious specific estimate.))

Alignment will happen by default. What’s next?

TsviBT5d20

Maybe the real issue is we don't know what AGI will be like, so we can't do science on it yet. Like pre-LLM alignment research, we're pretty clueless.

(This is my position, FWIW. We can ~know some things, e.g. convergent instrumental goals are very likely to either pursued, or be obsoleted by some even more powerful plan. E.g. highly capable agents will hack into lots of computers to run themselves--or maybe manufacture new computer chips--or maybe invent some surprising way of doing lots of computation cheaply.)

Alignment will happen by default. What’s next?

TsviBT5d810

we've reached the threshold in which it should at least think about it if this is what it truly care about

Ah ok. My guess is that we'll have a disagreement about this that's too hard to resolve in a timely fashion. My pretty strong guess is that the current systems are more like very high crystallized intelligence and pretty low fluid intelligence (whatever those should mean). (I've written about this a bit here and discussed it with Abram here.)

It's the fluid intelligence that would pull a system into thinking about things for reasons of instrumental convergence above and beyond their crystallized lines of reasoning.

Alignment will happen by default. What’s next?

TsviBT5d821

IDK if this is relevant, but, it doesn't have to think "instrumental convergence" in order to do instrumental convergence, just like the chess AI doesn't have to have thoughts about "endgames" as such.

Anyway, I thought you were suggesting that it would be strange / anomalous if AIs would not have thoughts about X for a while, and then at some point start having thoughts about X "by surprise". I'm saying that the reason is that

X is instrumentally convergent;
but the instrumentality of X takes some prerequisite level of capability, before you start following that convergence.

Alignment will happen by default. What’s next?

TsviBT5d2242

Why would they suddenly start having thoughts of taking over, if they never have yet, even if it is in the training data?

Taking over is a convergent instrumental goal. Chess AI doesn't steer toward good endgames, until it starts understanding what endgames are good, and then it does steer towards them; and you can predict that it will.

Wei Dai's Shortform

TsviBT1mo40

This is pretty related to 2--4, especially 3 and 4, but also: you can induce ontological crises in yourself, and this can be pretty fraught. Two subclasses:

You now think of the world in a fundamentally different way. Example: before, you thought of "one real world"; now you think in terms of Everett branches, mathematical multiverse, counterlogicals, simiulation, reality fluid, attention juice, etc. Example: before, a conscious being is a flesh-and-blood human; now it is a computational pattern. Example: before you took for granted a background moral perspective; now, you see that everything that produces your sense of values and morals is some algorithms, put there by evolution and training. This can disconnect previously-functional flows from values through beliefs to actions. E.g. now you think it's fine to suppress / disengage some moral intuition / worry you have, because it's just some neurological tic. Or, now that you think of morality as "what successfully exists", you think it's fine to harm other people for your own advantage. Or, now that you've noticed that some things you thought were deep-seated, truthful beliefs were actually just status-seeking simulacra, you now treat everything as status-seeking simulacra. Or something, idk.
You set off a self-sustaining chain reaction of reevaluating, which degrades your ability to control your decision to continue expanding the scope of reevaluation, which degrades your value judgements and general sanity. See: https://www.lesswrong.com/posts/n299hFwqBxqwJfZyN/adele-lopez-s-shortform?commentId=RZkduRGJAdFgtgZD5 , https://www.lesswrong.com/posts/n299hFwqBxqwJfZyN/adele-lopez-s-shortform?commentId=zWyC9mDQ9FTxKEqnT

These can also spread to other people (even if it doesn't happen to the philosopher who comes up with the instigating thoughts).

Buck's Shortform

TsviBT3mo64

Have you stated anywhere what makes you think "apparently a village idiot" is a sensible description of current learning programs, as they inform us regarding the question of whether or not we currently have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone soon?

Buck's Shortform

TsviBT3mo*69

If by intelligence you mean "we made some tests and made sure they are legible enough that people like them as benchmarks, and lo and behold, learning programs (LPs) continue to perform some amount better on them as time passes", ok, but that's a dumb way to use that word. If by intelligence you mean "we have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone", then I challenge you to provide the evidence / reasoning that apparently makes you confident that LP25 is at a ~human (village idiot) level of intelligence.

Cf. https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense

Views on when AGI comes and on strategy to reduce existential risk

TsviBT4mo20

I think I don't understand this argument. In creating AI we can draw on training data, which breaks the analogy to making a replicator actually from scratch (are you using a premise that this is a dead end, or something, because "Nearly all [thinkers] do not write much about the innards of their thinking processes..."?).

You're technically right that the analogy is broken in that way, yeah. Likewise, if someone gleans substantial chunks of the needed Architecture by looking at scans of brains. But yes, as you say, I think the actual data (in both cases) doesn't directly tell you what you need to know, by any stretch. (To riff on an analogy from Kabir Kumar: it's sort of like trying to infer the inner workings of a metal casting machine, purely by observing price fluctuations for various commodities. It's probably possible in theory, but staring at the price fluctuations--which are a highly mediated / garbled / fuzzed emanation from the "guts" of various manufacturing processes--is not a good way to discover the important ideas about how casting machines can work. Cf. https://www.lesswrong.com/posts/unCG3rhyMJpGJpoLd/koan-divining-alien-datastructures-from-ram-activations )

We've seen that supervised learning and RL (and evolution) can create structural richness (if I have the right idea of what you mean) out of proportion to the understanding that went into them.

Not sure I buy the claims about SL and RL. In the case of SL, it's only going "a little ways away from the data", in terms of the structure you get. Or so I claim uncertainly. (Hm... maybe the metaphor of "distance from the data" is quite bad.... really I mean "it's only exploring a pretty impoverished sector in structurespace, partly due to data and partly due to other Architecture".) In the case of RL, what are the successes in terms of gaining new learned structure? There's going to be some--we can point to AlphaZero, and maybe some robotics things--but I'm skeptical that this actually represents all that much structural richness. The actual NNs in AlphaZero would have some nontrivial structure, but hard to tell how much, and it's going to be pretty narrow / circumscribed, e.g. it wouldn't represent most interesting math concepts.

Anyway, the claim is of course true of evolution. The general point is true, that learning systems can be powerful, and specifically high-leverage in various ways (e.g. lots of learning from small algorithmic complexity fingerprint as with evolution or Solomonoff induction, or from fairly small compute as in humans).

Of course this doesn't mean any particular learning process is able to create a strong mind, but, idk, I don't see a way to put a strong lower bound on how much more powerful a learning process is necessary,

Right, no one knows. Could be next month that everyone dies from AGI. The only claims I'd really argue strongly would be claims like

If you have median 2029 or similar, either you're overconfident or you know something dispositive that I don't know.
If you have probability of AGI by 2029 less than .05%, either you're overconfident or you know something dispositive that I don't know.

Besides my comments about the bitter lesson and about the richness of evolution's search, I'll also say that it just seems to me like there's lots of ideas--at the abstract / fundamental / meta level of learning and thinking--that have yet to be put into practice in AI. I wrote in the OP:

The self-play that evolution uses (and the self-play that human children use) is much richer, containing more structural ideas, than the idea of having an agent play a game against a copy of itself.

IME if you think about these sorts of things--that is, if you think about how the 2.5 known great and powerful optimization processes (evolution, humans, humanity/science) do their impressive thing that they do--if you think about that, you see lots of sorts of feedback arrangements and ways of exploring the space of structures / algorithms, many of which are different in some fundamental character from what's been tried so far in AI. And, these things don't add up, in my head, to a general intelligence--though of course that is only a deficiency in my imagination, one way or another.

(EDIT: Maybe (you'd say) I should be drawing such a strong lower bound from the point about sample efficiency...?)

I don't personally lean super heavily on the sample efficiency thing. I mean, if we see a system that's truly only trained on some human data that's of size less than 10x the amount that a well-read adult human has read (plus compute / thinking), and it performs like GPT-4 or similar, that would be really weird and surprising, and I would be confused, and I'd be somewhat more scared. But I don't think it would necessarily imply that you're about to get AGI.

Conversely, I definitely don't think that high sample complexity strongly implies that you're not about to get AGI. (Well, I guess if you're about to get AGI, there should probably be spikes in sample efficiency in specific areas--e.g. you'd be able to invent much more interesting math with little or no data, whereas previously you had to train on vast math corpora. But we don't necessarily have to observe these domain spikes before dying of nanopox.)

Yeah, in particular it seems like I'm updating more than you from induction on the conceptual-progress-to-capabilities ratio we've seen so far / on what seem like surprises to the 'we need lots of ideas' view. (Or maybe you disagree about observations there, or disagree with that frame.) (The "missing update" should weaken this induction, but doesn't invalidate it IMO.)

Yeah... To add a bit of color, I'd say I'm pretty wary of mushing. Like, we mush together all "capabilities" and then update on how much "capabilities" our current learning programs have. I don't feel like that sort of reasoning ought to work very well. But I haven't yet articulated how mushing is anything more specific than categorization, if it is more specific. Maybe what I mean by mushing is "sticking to a category and hanging lots of further cognition (inferences, arguments, plans) on the category, without putting in suitable efforts to refine the category into subcategories". I wrote:

We should have been trying hard to retrospectively construct new explanations that would have predicted the observations. Instead we went with the best PREEXISTING explanation that we already had.

Mech interp is not pre-paradigmatic

TsviBT6mo3-2

So, as a field, we don't have to be happy with the dominant paradigm. But just because we're not happy with it doesn't mean it's not there.

Um, ok fine, so what alternative term do you propose to replace "pre-paradigmatic" as it is currently used, to indicate that there's no remotely satisfactory paradigm in which to get going on the parts of the field-to-be that really matter?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments