1a3orn - AI Alignment Forum

I agree this can be initially surprising to non-experts!

I just think this point about the amorality of LLMs is much better communicated by saying "LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style."

Than to say "LLMs are like alien shoggoths."

Like it's just a better model to give people.

TurnTrout's shortform feed

1a3orn6mo11

I like a lot of these questions, although some of them give me an uncanny feeling akin to "wow, this is a very different list of uncertainties than I have." I'm sorry the my initial list of questions was aggressive.

So I don't consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.

I'm not sure how they add up to alienness, though? They're about how we're different than models -- wheras the initial claim was that models are psychopathic, ammoral, etc.. If we say a model is "deeply alien" -- is that just saying it's different than us in lots of ways? I'm cool with that -- but the surplus negative valence involved in "LLMs are like shoggoths" versus "LLMs have very different performance characteristics than humans" seems to me pretty important.

Otherwise, why not say that calculators are alien, or any of the things in existence with different performance curves than we have? Chessbots, etc. If I write a loop in Python to count to 10, the process by which it does so is arguably more different from how I count to ten than the process by which an LLM counts to ten, but we don't call Python alien.

This feels like reminding an economics student that the market solves things differently than a human -- which is true -- by saying "The market is like Baal."

Do they require similar amounts and kinds of data to learn the same relationships?

There is a fun paper on this you might enjoy. Obviously not a total answer to the question.

TurnTrout's shortform feed

1a3orn6mo104

performs deeply alien cognition

I remain unconvinced that there's a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, "nah, LLMs aren't deeply alien."

If LLM cognition was not "deeply alien" what would the world look like?

What distinguishing evidence does this world display, that separates us from that world?

What would an only kinda-alien bit of cognition look like?

What would very human kind of cognition look like?

What different predictions does the world make?

Does alienness indicate that it is because the models, the weights themselves have no "consistent beliefs" apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?

Is it because they "often spout completely non-human kinds of texts"? Is the Mersenne Twister deeply alien? What counts as "completely non-human"?

Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a "moral compass" apart from the data on which they were trained? If I can use some part of my brain to improv a consistent Nazi, does that mean that it makes sense to call the part of my brain that lets me do that immoral or psychopathic?

Is it that the algorithms that we've found in DL so far don't seem to slot into readily human-understandable categories? Would a not-deeply-alien algorithm be able-to-be cracked open and show us clear propositions of predicate logic? If we had a human neocortex in an oxygen-infused broth in front of us, and we recorded the firing of every cell, do we anticipate that the algorithms there would be clear propositions of predicate logic? Would we be compelled to conclude that human neocortexes were deeply alien?

Or is it deeply alien because we think the substrate of thought is different, based on backprop rather than local learning? What if local learning could actually approximate backpropagation?. Or if more realistic non-backprop potential brain algorithms actually... kind just acted quite similarly to backprop, such that you could draw a relatively smooth line between them and backprop? Would this or more similar research impact whether we thought brains were aliens or not?

Does substrate-difference count as evidence against alien-ness, or does alien-ness just not make that kind of predictions? Is the cognition of an octopus less alien to us than the cognition of an LLM, because it runs on a more biologically-similar substrate?

Does every part of a system by itself need to fit into the average person's ontology for the total to not be deeply alien; do we need to be able to fit every part within a system into a category comprehensible by an untutored human in order to describe it as not deeply alien? Is anything in the world not deeply alien by this standard?

To re-question: What predictions can I make about the world because LLMs are "deeply alien"?

Are these predictions clear?

When speaking to someone who I consider a noob, is it best to give them terms whose emotive import is clear, but whose predictive import is deeply unclear?

What kind of contexts does this "deeply alien" statement come up in? Are those contexts people are trying to explain, or to persuade?

If I piled up all the useful terms that I know that help me predict how LLMs behave, would "deeply alien" be an empty term on top of these?

Or would it give me no more predictive value than "many behaviors of an LLM are currently not understood"?

AI as a science, and three obstacles to alignment strategies

1a3orn9mo1516

I agree that if you knew nothing about DL you'd be better off using that as an analogy to guide your predictions about DL than using an analogy to a car or a rock.

I do think a relatively small quantity of knowledge about DL screens off the usefulness of this analogy; that you'd be better off deferring to local knowledge about DL than to the analogy.

Or, what's more to the point -- I think you'd better defer to an analogy to brains than to evolution, because brains are more like DL than evolution is.

Combining some of yours and Habryka's comments, which seem similar.

The resulting structure of the solution is mostly discovered not engineered. The ontology of the solution is extremely unopinionated and can contain complicated algorithms that we don't know exist.

It's true that the structure of the solution is discovered and complex -- but the ontology of the solution for DL (at least in currently used architectures) is quite opinionated towards shallow circuits with relatively few serial ops. This is different than the bias for evolution, which is fine with a mutation that leads to 10^7 serial ops if it's metabolic costs are low. So the resemblance seems shallow other than "solutions can be complex." I think to the degree that you defer to this belief rather than more specific beliefs about the inductive biases of DL you're probably just wrong.

There's a mostly unimodal and broad peak for optimal learning rate, just like for optimal mutation rate

As far as I know optimal learning rate for most architectures is scheduled, and decreases over time, which is not a feature of evolution so far as I am aware? Again the local knowledge is what you should defer to.

You are ultimately doing a local search, which means you can get stuck at local minima, unless you do something like increase your step size or increase the mutation rate

Is this a prediction that a cyclic learning rate -- that goes up and down -- will work out better than a decreasing one? If so, that seems false, as far as I know.

Grokking/punctuated equilibrium: in some circumstances applying the same algorithm for 100 timesteps causes much larger changes in model behavior / organism physiology than in other circumstances

As far as I know grokking is a non-central example of how DL works, and in evolution punctuated equilibrium is a result of the non-i.i.d. nature of the task, which is again a different underlying mechanism from DL. If apply DL on non-i.i.d problems then you don't get grokking, you just get a broken solution. This seems to round off to, "Sometimes things change faster than others," which is certainly true but not predictively useful, or in any event not a prediction that you couldn't get from other places.

Like, leaving these to the side -- I think the ability to post-hoc fit something is questionable evidence that it has useful predictive power. I think the ability to actually predict something else means that it has useful predictive power.

Again, let's take "the brain" as an example of something to which you could analogize DL.

There are multiple times that people have cited the brain as an inspiration for a feature in current neural nets or RL. CNNS, obviously; the hippocampus and experience replay; randomization for adversarial robustness. You can match up interventions that cause learning deficiencies in brains to similar deficiencies in neural networks. There are verifiable, non-post hoc examples of brains being useful for understanding DL.

As far as I know -- you can tell me if there are contrary examples -- there are obviously more cases where inspiration from the brain advanced DL or contributed to DL understanding than inspiration from evolution. (I'm aware of zero, but there could be some.) Therefore it seems much more reasonable to analogize from the brain to DL, and to defer to it as your model.

I think in many cases it's a bad idea to analogize from the brain to DL! They're quite different systems.

But they're more similar than evolution and DL, and if you'd not trust the brain to guide your analogical a-theoretic low-confidence inferences about DL, then it makes more sense to not trust evolution for the same.

AI as a science, and three obstacles to alignment strategies

1a3orn9mo8-4

Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.

(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of preferences for "tasty" foods, laden with complications like “ice cream is good when it’s frozen but not when it’s melted”.)

I simply do not understand why people keep using this example.

I think it is wrong -- evolution does not grow minds, it grows hyperparameters for minds. When you look at the actual process for how we actually start to like ice-cream -- namely, we eat it, and then we get a reward, and that's why we like it -- then the world looks a a lot less hostile, and misalignment a lot less likely.

But given that this example is so controversial, even if it were right why would you use it -- at least, why would you use it if you had any other example at all to turn to?

Why on push so hard for "natural selection" and "stochastic gradient descent" to be beneath the same tag of "optimization", and thus to be able to infer things about the other from the analogy? Have we completely forgotten that the glory of words is not to be expansive, and include lots of things in them, but to be precise and narrow?.

Does evolution ~= AI have predictive power apart from doom? I have yet to see how natural selection helps me predict how any SGD algorithm works. It does not distinguish between Adam, AdamW. As far as I know it is irrelevant to Singular Learning Theory or NTK or anything else. It doesn't seem to come up when you try to look at NN biases. If it isn't an illuminating analogy anywhere else, why do we think the way it predicts doom to be true?

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments