## AI ALIGNMENT FORUMAF

Ulisse Mini

Born too late to explore Earth; born too early to explore the galaxy; born just the right time to save humanity.

# Sequences

Alignment stream of thought

# Wiki Contributions

Sorted by

Was considering saving this for a followup post but it's relatively self-contained, so here we go.

Why are huge coefficients sometimes okay? Let's start by looking at norms per position after injecting a large vector at position 20.

This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm

# transformer block forward() in GPT2
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))

If x has very large magnitude, then the block doesn't change it much relative to its magnitude. Additionally, attention is ran on the normalized x meaning only the "unscaled" version of x is moved between positions.

As expected, we see a convergence in probability along each token position when we look with the tuned lens.

You can see how for positions 1 & 2 the output distribution is decided at layer 20, since we overwrote the residual stream with a huge coefficient all the LayerNorm'd outputs we're adding are tiny in comparison, then in the final LayerNorm we get ln(bigcoeff*diff + small) ~= ln(bigcoeff*diff) ~= ln(diff).

Are most uncertainties we care about logical rather than informational? All empirical ML experiments are pure computations a Bayesian superintelligence could do in its head. How much of our uncertainty comes from computational limits in practice, versus actual information bottlenecks?

Random thought: Perhaps you could carefully engineer gradient starvation in order to "avoid generalizing" and defeat the Discrete modes of prediction example. You'd only need to delay it until reflection, then the AI can solve the successor AI problem.

In general: hack our way towards getting value-preserving reflectivity before values drift from "Diamonds" -> "What's labeled as a diamond by humans". (Replacing with "Telling the truth", and "What the human thinks is true" respectively).

I think school is huge in preventing people from becoming smart and curious. I spent 1-2years where I hardly studied at all and mostly played videogames - I wish I hadn't wasted that time, but when I quit I did so of my own free will. I think there's a huge difference between discipline imposed from the outside vs the inside, and getting to the latter is worth a lot. (though I wish I hadn't wasted all that time now haha)

I'm unsure which parts of my upbringing were cruxes for unschooling working. You should probably read a book or something rather than taking my (very abnormal) opinion. I just know how it went for me :)

Epistemic status: personal experience.

I'm unschooled and think it's clearly better, even if you factor in my parents being significantly above average in parenting. Optimistically school is babysitting, people learn nothing there while wasting most of their childhood. Pessimistically it's actively harmful by teaching people to hate learning/build antibodies against education.

Here's a good documentary made by someone who's been in and out of school. I can't give detailed criticism since I (thankfully) never had to go to school.

EDIT: As for what the alternative should be, I honestly don't know. Shifting equilibria is hard, though it's easy to give better examples (e.g. dath ilan, things in the documentary I linked.) For a personal solution: Homeschool your kids.

If the title is meant to be a summary of the post, I think that would be analogous to someone saying "nuclear forces provide an untapped wealth of energy". It's true, but the reason the energy is untapped is because nobody has come up with a good way of tapping into it.

The difference is people have been trying hard to harness nuclear forces for energy, while people have not been trying hard to research humans for alignment in the same way. Even relative to the size of the alignment field being far smaller, there hasn't been a real effort as far as I can see. Most people immediately respond with "AGI is different from humans for X,Y,Z reasons" (which are true) and then proceed to throw out the baby with the bathwater by not looking into human value formation at all.

Planes don't fly like birds, but we sure as hell studied birds to make them.

If you come up with a strategy for how to do this then I'm much more interested, and that's a big reason why I'm asking for a summary since I think you might have tried to express something like this in the post that I'm missing.

This is their current research direction, The shard theory of human values which they're currently making posts on.

I think even without point #4 you don't necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you're bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you're out of training to reward-hack). Also, every time you say "train it by X so it learns Y" you're assuming alignment (e.g. "digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion")