LW1.0 username Manfred. Day job is condensed matter physics, hobby is thinking I know how to assign anthropic probabilities.
I initially thought you were going to debate Beth Barnes.
Also, thanks for the episode :) It was definitely interesting, although I still don't have a good handle on why some people are optimistic that there aren't classes of arguments humans will "fall for" irrespective of their truth value.
One generalization I am also interested in is to learn not merely abstract objects within a big model, but entire self-contained abstract levels of description, together with actions and state transitions that move you between abstract states. E.g. not merely detecting that "the grocery store" is a sealed box ripe for abstraction, but that "go to the grocery store" is a valid action within a simplified world-model with nice properties.
This might be significantly more challenging to say something interesting about, because it depends not just on the world but on how the agent interacts with the world.
Very nice overview! Of course, I think most of the trick is crammed into that last bit :) How do you get a program to find the "common-sense" implied model of the world to use for counterfactuals.
Even when talking about how humans shouldn't always be thought of as having some "true goal" that we just need to communicate, it's so difficult to avoid talking in that way :) We naturally phrase alignment as alignment to something - and if it's not humans, well, it must be "alignment with something bigger than humans." We don't have the words to be more specific than "good" or "good for humans," without jumping straight back to aligning outcomes to something specific like "the goals endorsed by humans under reflective equilibrium" or whatever.
We need a good linguistic-science fiction story about a language with no such issues.
I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I'm wrong and there's some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.
Sure, but if you're training on less data it's because fewer parameters is worse :P
I'm not sure how your reply relates to my guess, so I'm a little worried.
If you're intending the compute comment to be in opposition to my first paragraph, then no - when finetuning a subset of the parameters, compute is not simply proportional to the size of the subset you're finetuning, because you still have to do all the matrix multiplications of the original model, both for inference and gradient propagation. I think the point for the paper only finetuning a subset was to make a scientific point, not save compute.
My edit question was just because you said something about expecting the # of steps to be 3 OOM for a 3 OOM smaller model. But iirc really it's more like the compute will be smaller, but the # of steps won't change much (they're just cheaper).
Do you have a reference for this picture of "need lots more data to get performance improvements?" I've also heard some things about a transition, but as a transition from compute-limited to data-limited, which means "need lots more compute to get performance improvements."
I think it's plausible that the data dependence will act like it's 3 OOM smaller. Compute dependence will be different, though, right? Even if you're just finetuning part of the model you have to run the whole thing to do evaluation. In a sense this actually seems like the worst of both worlds (but you get the benefit from pretraining).
Edit: Actually, I'm confused why you say a smaller model needs that factor fewer steps. I thought the slope on that one was actually quite gentle. It's just that smaller models are cheap - or am I getting it wrong?
But is that true? Human behavior has a lot of information. We normally say that this extra information is irrelevant to the human's beliefs and preferences (i.e. the agential model of humans is a simplification), but it's still there.
Sure, but the interesting thing to me isn't fixed points in the input/output map, it's properties (i.e. attractors that are allowed to be large sets) that propagate from the answers seen by a human in response to their queries, into their output.
Even if there's a fixed point, you have to further prove that this fixed point is consistent - that it's actually the answer to some askable question. I feel like this is sort of analogous to Hofstadter's q-sequence.