Logan Riggs Smith

Wiki Contributions


I'd love to hear whether you found this useful, and whether I should bother making a second half!

We had 5 people watch it here, and we would like a part 2:)

We had a lot of fun pausing the video and making forward predictions, and we couldn't think of any feedback for you in general. 

Notably the model was trained across multiple episodes to pick up on RL improvement.

Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.

I believe you’re equating “frozen weights” and “amnesiac/ can’t come up with plans”.

GPT is usually deployed by feeding back into itself its own output, meaning it didn’t forget what it just did, including if it succeeded at its recent goal. Eg use chain of thought reasoning on math questions and it can remember it solved for a subgoal/ intermediate calculation.

How would you end up measuring deception, power seeking, situational awareness?

We can simulate characters with GPT now that are deceptive (eg a con artist talking to another character). Similar with power seeking and situational awareness (eg being aware it’s GPT)

On your first point, I do think people have thought about this before and determined it doesn't work. But from the post:

If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insight I gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature universality hypothesis for neural networks, about natural abstractions.

Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?

 If you're convinced by them, then you'll understand the reaction of "Fuck, we've been wasting so much time and studying humans makes so much sense" which is described in this post (e.g. Turntrout's idea on corrigibility and statement "I wrote this post as someone who previously needed to read it."). I'm stating here that me arguing "you should feel this way now before being convinced of specific mechanistic understandings" doesn't make sense when stated this way.

Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come "for free" with model-based agents.

Link? I don't think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I'm open to being wrong.

Oh, you're stating potential mechanisms for human alignment w/ humans that you don't think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize. 

Turntrout's other post claims that the genome likely doesn't directly specify rewards for everything humans end up valuing. People's specific families aren't encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values. 

A related point: humans don't maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn't push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use. 

There are other mechanisms, and I believe it's imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems. 

To add, Turntrout does state:

In an upcoming post, I’ll discuss one particularly rich vein of evidence provided by humans.

so the doc Ulisse provided is a decent write-up about just that, but there are more official posts intended to published.

Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:)

There are other mechanisms which influence other things, but I wouldn't necessarily trust them to generalize either.

Could you elaborate?

I believe the diamond example is true, but not the best example to use. I bet it was mentioned because of the arbital article linked in the post. 

The premise isn't dependent on diamonds being terminal goals; it could easily be about valuing real life people or dogs or nature or real life anything. Writing an unbounded program that values real world objects is an open-problem in alignment; yet humans are a bounded program that values real world objects all of the time, millions of times a day. 

The post argues that focusing on the causal explanations behind humans growing values is way more informative than other sources of information, because humans exist in reality and anchoring your thoughts to reality is more informative about reality.

There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post's point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:

the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes form their values, and, more generally, about how intelligent minds can form values at all.

Humans can provide a massive amount of info on how highly intelligent systems value things in the real world. There are guaranteed-to-exist mechanisms behind why humans value real world things and mechanisms behind the variance in human values, and the post argues we should look at these mechanisms first (if we're able to). I predict that a mechanistic understanding will enable the below knowledge: 

I aspire for the kind of alignment mastery which lets me build a diamond-producing AI, or if that didn’t suit my fancy, I’d turn around and tweak the process and the AI would press green buttons forever instead, or—if I were playing for real—I’d align that system of mere circuitry with humane purposes.

Load More