p.b. — AI Alignment Forum

Steering GPT-2-XL by adding an activation vector

p.b.3y30

This is really cool work! Congratulations!

Besides the LLM related work it also reminds somewhat of dynamic prompting in Stable Diffusion, where part of the prompt is changed after a number of steps to achieve a mixture of promp1 and prompt2.

What's the TL;DR for the Vicuna 13B experiments?

Who models the models that model models? An exploration of GPT-3's in-context model fitting ability

p.b.4y20

This is a t-SNE I made a couple of years ago of the glove-wordvectors for numbers. So it's not surprising that there is a "number sense", though I am definitely surprised how good some of the results are.

Fun fact: Fitting the Iris dataset with a tiny neural network can be suprisingly fickle.

AGI Ruin: A List of Lethalities

p.b.4y98

The point I'm making is that the human example tells us that:

If first we realize that we can't code up our values, therefore alignment is hard. Then, when we realize that mesa-optimisation is a thing. we shouldn't update towards "alignment is even harder". We should update in the opposite direction.

Because the human example tells us that a mesa-optimiser can reliably point to a complex thing even if the optimiser points to only a few crude things.

But I only ever see these three points, human example, inability to code up values, mesa-optimisation to separately argue for "alignment is even harder than previously thought". But taken together that is just not the picture.

AGI Ruin: A List of Lethalities

p.b.4y910

Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.

Humans haven't been optimized to pursue inclusive genetic fitness for very long, because humans haven't been around for very long. Instead they inherited the crude heuristics pointing towards inclusive genetic fitness from their cognitively much less sophisticated predecessors. And those still kinda work!

If we are still around in a couple of million years I wouldn't be surprised if there was inner alignment in the sense that almost all humans in almost all practically encountered environments end up consciously optimising inclusive genetic fitness.

More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.

Generally, I think that people draw the wrong conclusions from mesa-optimisers and the examples of human evolutionary alignment.

Saying that we would like to solve alignment by specifying exactly what we want and then let the AI learn exactly what we want, is like saying that we would like to solve transportation by inventing teleportation. Yeah, would be nice but unfortunately it seems like you will have to move through space instead.

The conclusion we should take from the concept of mesa-optimisation isn't "oh no alignment is impossible", that's equivalent to "oh no learning is impossible". But learning is possible. So the correct conclusion is "alignment has to work via mesa-optimisation".

Because alignment in the human examples (i.e. human alignment to evolution's objective and humans alignment to human values) works by bootstrapping from incredibly crude heuristics. Think three dark patches for a face.

Humans are mesa-optimized to adhere to human values. If we were actually inner aligned to the crude heuristics that evolution installed in us for bootstrapping the entire process, we would be totally disfunctional weirdoes.

I mean even more so ...

To me the human examples suggest that there has to be a possibility to get from gesturing at what we want to getting what we want. And I think we can gesture a lot better than evolution! Well, at least using much more information than 3.2 billion base pairs.

If alignment has to be a bootstrapped open ended learning process there is also the possibility that it will work better with more intelligent systems or really only start working with fairly intelligent systems.

Maybe bootstrapping with cake, kittens and cuddles will still get us paperclipped, I don't know. It certainly seems awfully easy to just run straight off a cliff. But I think looking at the only known examples of alignment of intelligences does allow us more optimistic takes than are prevalent on this page.

AI Tracker: monitoring current and near-future risks from superscale models

p.b.4y70

Much better now!

The date published vs date trained was on my mind because of Gopher. It seemed to me very relevant,that Deepmind trained a significantly larger model within basically half a year of the publication of GPT-3.

In addition to google brain also being quite coy about their 100+B model it made me update a lot in the direction of "the big players will replicate any new breakthrough very quickly but not necessarily talk about it."

To be clear, I also think it probably doesn't make sense to include this information in the list, because it is too rarely relevant.

AI Tracker: monitoring current and near-future risks from superscale models

p.b.4y30

Some ideas for improvements:

The ability to sort by model size etc would be nice. Currently sorting is alphabetical.

Also the rows with long textual information should be more to the right and the more informative/tighter/numerical columns more to the left (like "deep learning" in almost all rows, not very informative). Ideally the most relevant information would be on the initial page without scrolling.

"Date published" and "date trained" can be quite different. Maybe worth including the latter?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments