We’ll be able to fine-tune in the test environment so won’t experience OOD at deployment, and while changes will happen, continual fine-tuning will be good enough to stop the model from ever being truly OOD. We think this may apply in settings where we’re using the model for prediction, but it’s unclear whether continual fine-tuning will be able to help models learn and adapt to the rapid OOD shifts that could occur when the models are transferred from offline learning to online interaction at deployment.
Couldn't the model just fail at the start of fine-tuning (because it's causally confused), then learn in a decision setting to avoid causal confusion, and then no longer be causally confused? If no - I'm guessing you expect that the model only unlearns some of its causal confusion. And there's always enough left so that after the next distribution shift the model again performs poorly. If so, I'd be curious why you believe that the model won't unlearn all or most of its causal confusion.
This distillation was useful for me, thanks for making it! As feedback, I got stuck at the bullet-point explanation of imitative generalization. There was not enough detail to understand it so I had to read Beth's post first and try connect it to your explanation. For example kind of changes are we considering? To what model? How do you evaluate if an change lets the human make better predictions?
A large amount of math describes the relations between agents at the same level of analysis: this is almost all of game theory. [...] our focus is on "vertical" relations, between composite agents and their parts.
This seems to be what is studied in the fields of organizational economics and to some extent in industrial organization / vertical integration. These fields have a great deal of game theory on vertical relationships, particularly relationships between the firm and its employees, managers, and contractors. Some of this can probably be ported to your interfaces. These fields are unsolved though, which means there's work left to do, but also that it's been difficult to find simple solutions, perhaps because you're modeling complex phenomena.
I like your section on self-unaligned agents btw. Curious what comes out of your centre.
Some minor feedback points: Just from reading the abstract and intro, this could be read as a non-sequitur: "It limits our ability to mitigate short-term harms from NLP deployments". Also, calling something a "short-term" problem doesn't seem necessary and it may sound like you think the problem is not very important.
OpenAI's work speeds up progress, but in a way that's likely smooth progress later on. If you spend as much compute as possible now, you reduce potential surprises in the future.
On 2): Being overparameterized doesn't mean you fit all your training data. It just means that you could fit it with enough optimization. Perhaps the existence of some Savant people shows that the brain could memorize way more than it does.
On 3): The number of our synaptic weights is stupendous too - about 30000 for every second in our life.
On 4): You can underfit at the evolution level and still overparameterize at the individual level.
Overall you convinced me that underparameterization is less likely though. Especially on your definition of overparameterization, which is relevant for double descent.
Why do you think that humans are, and powerful AI systems will be, severely underparameterized?
Also interesting to see that all of these groups were able to coordinate to the disadvantage of less coordinates groups, but not able to reach peace among themselves.
One explanation might be that the more coordinated groups also have harder coordination problems to solve because their world is bigger and more complicated. Might be the same with AI?
If X is "number of paperclips" and Y is something arbitrary that nobody optimizes, such as the ratio of number of bicycles on the moon to flying horses, optimizing X should be equally likely to increase or decrease Y in expectation. Otherwise "1-Y" would go in the opposite direction which can't be true by symmetry. But if Y is something like "number of happy people", Y will probably decrease because the world is already set up to keep Y up and a misaligned agent could disturb that state.
I should've specified that the strong version is "Y decreases relative to a world where neither of X nor Y are being optimized". Am I right that this version is not true?