Sorted by New

Wiki Contributions


Man, what a post!

My knowledge of alignment is somewhat limited, so keep in mind some of my questions may be a bit dumb simply because there are holes in my understanding.

It seems hard to scan a trained neural network and locate the AI’s learned “tree” abstraction. For very similar reasons, it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, we infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode…

I basically agree with the last sentence of this statement, but I'm trying to figure out how to square it with my knowledge of genetics. Political attitudes, for example, are heritable. Yet I agree there are no hardcoded versions of "democrat" or "republican" in the brain.

This leaves us with a huge puzzle. If we can’t say “the hardwired circuitry down the street did it”, where do biases come from? How can the genome hook the human’s preferences into the human’s world model, when the genome doesn’t “know” what the world model will look like? Why do people usually navigate ontological shifts properly, why don’t people want to wirehead, why do people almost always care about other people if the genome can’t even write circuitry that detects and rewards thoughts about people?”.

This seems wrong to me. Twin studies, GCTA estimates, and actual genetic predictors all predict that a portion of the variance in human biases is "hardcoded" in the genome. So the genome is definitely playing a role in creating and shaping biases. I don't know exactly how it does that, but we can observe that such biases are heritable, and we can actually point to specific base pairs in the genome that play a role.

Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.

Wow. I'm not sure if you're aware of this research, but shard theory sounds shockingly similar to Guynet's description of how the parasitic lamprey fish make decisions in "The Hungry Brain". Let me just quote the whole section from Scott Alexander's Review of the book:

How does the lamprey decide what to do? Within the lamprey basal ganglia lies a key structure called the striatum, which is the portion of the basal ganglia that receives most of the incoming signals from other parts of the brain. The striatum receives “bids” from other brain regions, each of which represents a specific action. A little piece of the lamprey’s brain is whispering “mate” to the striatum, while another piece is shouting “flee the predator” and so on. It would be a very bad idea for these movements to occur simultaneously – because a lamprey can’t do all of them at the same time – so to prevent simultaneous activation of many different movements, all these regions are held in check by powerful inhibitory connections from the basal ganglia. This means that the basal ganglia keep all behaviors in “off” mode by default. Only once a specific action’s bid has been selected do the basal ganglia turn off this inhibitory control, allowing the behavior to occur. You can think of the basal ganglia as a bouncer that chooses which behavior gets access to the muscles and turns away the rest. This fulfills the first key property of a selector: it must be able to pick one option and allow it access to the muscles.

Spoiler: the pallium is the region that evolved into the cerebral cortex in higher animals.

Each little region of the pallium is responsible for a particular behavior, such as tracking prey, suctioning onto a rock, or fleeing predators. These regions are thought to have two basic functions. The first is to execute the behavior in which it specializes, once it has received permission from the basal ganglia. For example, the “track prey” region activates downstream pathways that contract the lamprey’s muscles in a pattern that causes the animal to track its prey. The second basic function of these regions is to collect relevant information about the lamprey’s surroundings and internal state, which determines how strong a bid it will put in to the striatum. For example, if there’s a predator nearby, the “flee predator” region will put in a very strong bid to the striatum, while the “build a nest” bid will be weak…

Each little region of the pallium is attempting to execute its specific behavior and competing against all other regions that are incompatible with it. The strength of each bid represents how valuable that specific behavior appears to the organism at that particular moment, and the striatum’s job is simple: select the strongest bid. This fulfills the second key property of a selector – that it must be able to choose the best option for a given situation…

With all this in mind, it’s helpful to think of each individual region of the lamprey pallium as an option generator that’s responsible for a specific behavior. Each option generator is constantly competing with all other incompatible option generators for access to the muscles, and the option generator with the strongest bid at any particular moment wins the competition.

You can read the whole review here or the book here. It sounds like you may have independently rederived a theory of how the brain works that neuroscientists have known about for a while.

I think this independent corroboration of the basic outline of the theory makes it even more likely shard theory is broadly correct.

I hope someone can work on the mathematics of shard theory. It seems fairly obvious to me that shard theory or something similar to it is broadly correct, but for it to impact alignment, you're probably going to need a more precise definition that can be operationalized and give specific predictions about the behavior we're likely to see.

I assume that shards are composed of some group of neurons within a neural network, correct? If so, it would be useful if someone can actually map them out. Exactly how many neurons are in a shard? Does the number change over time? How often do neurons in a shard fire together? Do neurons ever get reassigned to another shard during training? In self-supervised learning environments, do we ever observe shards guiding behavior away from contexts in which other shards with opposing values would be activated?

Answers to all the above questions seem likely to be downstream of a mathematical description of shards.