Oliver Habryka

Running Lightcone Infrastructure, which runs LessWrong. You can reach me at habryka@lesswrong.com

Wiki Contributions


Promoted to curated: I think it's pretty likely a huge fraction of the value of the future will be determined by the question this post is trying to answer, which is how much game theory produces natural solutions to coordination problems, or more generally how much better we should expect systems to get at coordination as they get smarter.

I don't think I agree with everything in the post, and a few of the characterizations of updatelessness seem a bit off to me (which Eliezer points to a bit in his comment), but I still overall found reading this post quite interesting and valuable for helping me think about for which of the problems of coordination we have a more mechanistic understanding of how being smarter and better at game theory might help, and which ones we don't have good mechanisms for, which IMO is a quite important question.

Promoted to curated: I disagree with a bunch of the approach outlined in this post, but I nevertheless found this framing quite helpful for thinking about various AI X-risk related outcomes and plans. I also really appreciate the way this post is written, being overall both approachable while maintaining relatively much precision in talking about these issues. 

I also think the cognition in a weather model is very alien. It's less powerful and general, so I think the error of applying something like the Shoggoth image to that (or calling it "alien") would be that it would imply too much generality, but the alienness seems appropriate. 

If you somehow had a mind that was constructed on the same principles as weather simulations, or your laptop, or your toaster (whatever that would mean, I feel like the analogy is fraying a bit here), that would display similar signs of general intelligence as LLMs, then yeah, I think analogizing them to alien/eldritch intelligences would be pretty appropriate.

It is a very common (and even to me tempting) error to see a system with the generality of GPT-4, trained on human imitation, and imagine that it must internally think like a human. But my best guess is that is not what is going on, and in some sense it is valuable to be reminded that the internal cognition going on in GPT-4 is probably similarly far from what is going in a human brain as a weather simulation is very different from what is going in a human trying to forecast the weather (de-facto I think GPT-4 is somewhere in-between since I do think the imitation learning does create some structural similarities that are stronger between humans and LLMs, but I think overall being reminded of this relevant dimension of alienness pays off in anticipated experiences a good amount). 

These are a lot of questions, my guess is most of which are rhetorical, so not sure which ones you are actually interested in getting an answer on. Most of the specific questions I would answer with "no", in that they don't seem to capture what I mean by "alien", or feel slightly strawman-ish.

Responding at a high-level: 

  • There are a lot of experiments that seem like they shed light on the degree to which cognition in AI systems is similar to human or animal cognition. Some examples: 
    • Does the base model pass a Turing test?
    • Does the performance distribution of the base model on different tasks match the performance distribution of humans?
    • Does the generalization and learning behavior of the base model match how humans learn things?
      • When trained using RL on things like game-environments (after pre-training on a language corpus), does the system learn at similar rates and plateau at similar skill levels as human players?
  • There are a lot of structural and algorithmic properties that could match up between human and LLM systems: 
    • Do they interface with the world in similar ways?
    • Do they require similar amounts and kinds of data to learn the same relationships?
    • Do the low-level algorithmic properties of how human brains store and process information look similar between the two systems?
  • A lot more stuff, but I am not sure how useful going into a long list here is. At least to me it feels like a real thing, and different observations would change the degree to which I would describe a system as alien.

I think the exact degree of alienness is really interesting and one of the domains where I would like to see more research. 

For example, a bunch of the experiments I would most like to see, that seem helpful with AI Alignment, are centered on better measuring the performance distribution of transformer architectures on tasks that are not primarily human imitation, so that we could better tell which things LLMs have a much easier time learning than humans (currently even if a transformer could relatively easily reach vastly superhuman performance at a task with more specialized training data, due to the structure of the training being oriented around human imitation, observed performance at the task will cluster around human level, but seeing where transformers could reach vast superhuman performance would be quite informative on understanding the degree to which its cognition is alien).

So I don't consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.

The point was that both images are stupid and (in many places) unsupported by evidence, but that LW-folk would be much more willing to criticize the friendly-looking one while making excuses for the scary-looking one. (And I think your comment here resolves my prediction to "correct.")

(This is too gotcha shaped for me, so I am bowing out of this conversation)

I think I communicated my core point. I think it's a good image that gets an important insight across, and don't think it's "propaganda" in the relevant sense of the term. Of course anything that's memetically adaptive will have some edge-cases that don't match perfectly, but I am getting a good amount of mileage out of calling LLMs "Shoggoths" in my own thinking and think that belief is paying good rent.

If you disagree with the underlying cognition being accurately described as alien, I can have that conversation, since it seems like maybe the underlying crux, but your response above seems like it's taking it as a given that I am "making excuses", and is doing a gotcha-thing which makes it hard for me to see a way to engage without further having my statements be taken as confirmation of some social narrative. 

However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It's at this point that I'd like to point out the simple, obvious fact that "we don't actually know how these models work, and we definitely don't know that they're creepy and dangerous on the inside.

That's just one of many shoggoth memes. This is the most popular one: 

David Weiner 📼🔪🛸 on X: "“The Shoggoth is a potent ...

The shoggoth here is not particularly exaggerated or scary.

Responding to your suggested alternative that is trying to make a point, it seems like the image fails to be accurate, or it seems to me to convey things we do confidently know are false. It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass. Describing them as a Shoggoth seems pretty reasonable to me, as far as alien intelligences go (alternative common imagery for alien minds are insects or ghosts/spirits with distorted forms, which would evoke similar emotions).

Your picture doesn't get any of that across. It doesn't communicate that the base model does not at all behave like a human would (though it would have isolated human features, which is what the eyes usually represent). It just looks like a cute plushy, but "a cute plushy" doesn't capture any of the experiences of interfacing with a base model (and I don't think the image conveys multiple layers of some kind of training, though that might just be a matter of effort).

I think the Shoggoth meme is pretty good pedagogically. It captures a pretty obvious truth, which is that base models are really quite alien to interface with, that we know that RLHF probably does not change the underlying model very much, but that as a result we get a model that does have a human interface and feels pretty human to interface with (but probably still performs deeply alien cognition behind the scenes). 

This seems like good communication to me. Some Shoggoth memes are cute, some are exaggerated to be scary, which also seems reasonable to me since alien intelligences seem like are pretty scary, but it's not a necessary requirement of the core idea behind the meme. 

Promoted to curated: Overall this seems like an important and well-written paper, that also stands out for its relative accessibility for an ML-heavy AI Alignment paper. I don't think it's perfect, and I do encourage people to read the discussion on the post for various important related arguments, but it overall seems like a quite good paper that starts to bridge the gap between prosaic work and concerns that have historically been hard to empirically study, like deceptive alignment.

Hmm, I still don't get it. 

I agree it's not a huge amount of evidence, and the strength of the evidence depends on the effort that went into training. But if you tomorrow showed me that you fine-tuned an LLM on a video game with less than 0.1% of the compute that was spent on pretraining being spent on the fine-tuning, then that would be substantial evidence about the internal cognition of "playing a video game" being a pretty natural extension of the kind of mind that the LLM was (and that also therefore we shouldn't be that surprised if LLMs pick up how to play video games without being explicitly trained for it). 

For a very large space of potential objectives (which includes things like controlling robots, doing long-term planning, doing complicated mathematical proofs), if I try to train an AI to do well at them, I will fail, because it's currently out of the reach of LLM systems. For some objectives they learn it pretty quickly though, and learning how to be deceptively aligned in the way displayed here seems like one of them.

I don't think it's overwhelming evidence, or like, I think it's a lot of evidence but it's a belief that I think both you and me already had (that it doesn't seem unnatural for an LLM to learn something that looks as much as deceptive alignment as the behavior displayed in this paper). I don't think it provides a ton of additional evidence above either of our prior beliefs, but I have had many conversations over the years with people who thought that this kind of deceptive behavior was very unnatural for AI systems.

I don't super understand the relevance of the linked quote and image. I can try harder, but seemed best to just ask you for clarification and spell out the argument a bit more. 

These kinds of overview posts are very valuable, and I think this one is as well. I think it was quite well executed, and I've seen it linked a lot, especially to newer people trying to orient to the state of the AI Alignment field, and the ever growing number of people working in it. 

Load More