dkirmani - AI Alignment Forum

systems that have a tendency to evolve towards a narrow target configuration set when started from any point within a broader basin of attraction, and continue to do so despite perturbations.

When determining whether a system "optimizes" in practice, the heavy lifting is done by the degree to which the set of states that the system evolves toward -- the suspected "target set" -- feels like it forms a natural class to the observer.

The issue here is that what the observer considers "a natural class" is informed by the data-distribution that the observer has previously been exposed to.

Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Daniel Kirmani2y10

guessing this wouldn't work without causal attention masking

Thoughts on safety in predictive learning

Daniel Kirmani3y00

Nope, that's an accurate representation of my views. If "postdiction" means "the machine has no influence over its sensory input", then yeah, that's a really good idea.

There are 2 ways to reduce prediction error: change your predictions, or act upon the environment to make your predictions come true. I think the agency of an entity depends on how much of each they do. An entity with no agency would have no influence over its sensory inputs, instead opting to update beliefs in the face of prediction error. Taking agency from AIs is a good idea for safety.

Scott Alexander recently wrote about a similar quantity being ecoded in humans through the 5-HT1A / 5-HT2A receptor activation ratio: link

Thoughts on safety in predictive learning

Daniel Kirmani3y20

A common trope is that brains are trained on prediction. Well, technically, I claim it would be more accurate to say that they’re trained on postdiction. Like, let’s say I open a package, expecting to see a book, but it’s actually a screwdriver. I’m surprised, and I immediately update my world-model to say "the box has a screwdriver in it".

I would argue that that the book-expectation is a prediction, and the surprise you experience is a result of low mutual information between your retinal activation patterns and the book-expectation in your head. That surprise (prediction error) is the learning signal that propagates up to your more abstract world-model, which updates into a state consistent with "the box has a screwdriver in it".

During this process, there was a moment when I was just beginning to parse the incoming image of the newly-opened box, expecting to see a book inside. A fraction of a second later, my brain recognizes that my expectation was wrong, and that’s the “surprise”. So in other words, my brain had an active expectation about something that had already happened—about photons that had by then already arrived at my retina—and that expectation was incorrect, and that’s what spurred me to update my world-model. Postdiction, not prediction.

Right, but the part of your brain that had that high-level model of "there is a book in the box" had at that time not received contradictory input from the lower-level edge detection / feature extraction areas. The abstract world-model does not directly predict retinal activations, it predicts the activations of lower-level sensory processing areas, which in turn predict the activations of the retina, cochlea, etc. There is latency in this system, so the signal takes a bit of time to filter up from the retinas to lower-level visual areas to your world-model. I don't think 'post-diction' makes sense in this context, as each brain region is predicting the activations of the one below it, and updates its state when those predictions are wrong.

(Also, I think Easy Win #3 is a really good point for Predict-O-Matic-esque sytems)

AI ALIGNMENT FORUM
AF

Posts

Wiki Contributions

Comments