Alignment in Thought Chains

Faust Nemesis

This post was rejected for the following reason(s):

Low Quality or 101-Level AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meets a pretty high bar. We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. You're welcome to post quotes in the latest AI Questions Open Thread.
Sorry, LessWrong has a particularly high bar for content from new users and this didn't quite make the cut. (We have a somewhat higher bar for approving a user's first post or comment)
Formatting. If the post is badly formatted it's hard to read or evaluate. Some common issues here are improper whitespace (either not inserting space between paragraphs, or inserting double paragraph spaces by accident. (Note: when you hit 'return' in our editor it should automatically include a space, and if you copied your essay from another editor you may need to delete extraneous paragraph breaks). Sometimes this may also include grammar or punctuation. (If you're the sort of person who strongly prefers not to capitalize sentences, this doesn't automatically disqualify you from posting but we'll likely suggest at least once you switch to somewhat more formal punctuation, and if your posts are otherwise confusing we may err on the side of not approving.)

Alignment is fundamentally about human imitation. We want machines that resemble us in certain key aspects, but evidently not all. These aspects are mostly linked with inner motivations. Since current models are very complicated, alignment will only be feasible if we can target where and at what level of abstraction it must take place. Take the thoughts "Maya is cold. That cat is dear to me. I have a blanket. I will use it to keep it warm." and "The family's cat is cold. I don't want to be in trouble for its death. I will use the blanket to keep it warm." Even though these two thoughts produce the same actions, there is a stark difference in motivation. There can also be different reasoning and actions, but similar intentions: "The cat is freezing. I don't want the cat to die, for I like him. Therefore, I will raise the thermostat to keep it warm." Intentions are much easier to know if you have access to the thoughts. Therefore, we must align our models by looking at the thoughts behind their actions. If you cannot hide your thoughts well, you cannot hide your motivations well either, for the former is seeded in the latter. Our neural networks are irreducibly complicated. We will never fully understand the computation that they are doing by looking at their weights. But take the thought "I am very cold and I have a sweater. The family's cat, Maya, is shivering. Therefore, she is cold. Encapsulating a being can reduce the outward flow of heat. This blanket can adapt its form to objects I put it unto. Therefore, I will place it on the cat to keep it warm." Going down a level of abstraction did not provide more information about the intentions of the cat owner. Therefore, we can avoid work by only targeting a higher level of thought. No need to understand every parameter in the network. It should be intuitive that most thoughts are not formed all at once. They are formed step by step. You work to form conclusions, or intermediate thoughts. Once these have been formed, you take the results as a stepping stone to reach a vantage point from which you will continue the chain, without redoing the computation every time (this is why mathematics is seen as a platonic ideal of thinking: a proof is a label that tells you that a stepping stone will be eternally solid). If we train our models to generate these intermediate thoughts in a readable format, we can apply reinforcement learning with human feedback to align them. Then our agents must be trained such that their strong capabilities are only reached by forming chains of thoughts that are verifiable by us. In order for training to remain fast, RLHF should only be applied at intermediate iterations during the training process, but with big learning rates. The rest of the time, training should only be applied on the outputs.

Alignment in Thought Chains

1

1