Definition. On how I use words, values are decision-influences (also known as shards). “I value doing well at school” is a short sentence for “in a range of contexts, there exists an influence on my decision-making which upweights actions and plans that lead to e.g. learning and good grades and honor among my classmates.”
Summaries of key points:
Overall disagreement:
I've remained somewhat confused about the exact grader/non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I've focused on giving examples, in the hope of getting the vibe across.)
Yeah, I think I have at least some sense of how this works in the kinds of examples you usually discuss (though my sense is that it's well captured by the "grader is complicit" point in my previous comment, which you presumably disagree with).
But I don't see how to extend the extensional definition far enough to get to the ...
Alternate title: "Somewhat Contra Scott On Simulators".
Scott Alexander has a recent post up on large language models as simulators.
I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant.
(But see caveats about the simulator framing from Beth Barnes here.)
These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun.
In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner...
Interesting, I think this clarifies things, but the framing also isn't quite as neat as I'd like.
I'd be tempted to redefine/reframe this as follows:
• Outer alignment for a simulator - Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren't actually literal?
• Inner alignment for a simulator - Training a simulator to perfectly simulate the assigned character
• Outer alignment for characters - fi...
This will be posted also on the EA Forum, and included in a sequence containing some previous posts and other posts I'll publish this year.
Humans think critically about values and, to a certain extent, they also act according to their values. To the average human, the difference between increasing world happiness and increasing world suffering is huge and evident, while goals such as collecting coins and collecting stamps are roughly on the same level.
It would be nice to make these differences obvious to AI as they are to us. Even though exactly copying what happens in the human mind is probably not the best strategy to design an AI that understands ethics, having an idea of how value works in humans is a good starting point.
So, how...
Originally posted on the EA Forum for the Criticism and Red Teaming Contest. Will be included in a sequence containing some previous posts and other posts I'll publish this year.
AI alignment research centred around the control problem works well for futures shaped by out-of-control misaligned AI, but not that well for futures shaped by bad actors using AI. Section 1 contains a step-by-step argument for that claim. In section 2 I propose an alternative which aims at moral progress instead of direct risk reduction, and I reply to some objections. I will give technical details about the alternative at some point in the future, in section 3.
The appendix clarifies some minor ambiguities with terminology and links to other stuff.
Produced as part of the SERI ML Alignment Theory Scholars Program Winter 2022 Cohort.
Not all global minima of the (training) loss landscape are created equal.
Even if they achieve equal performance on the training set, different solutions can perform very differently on the test set or out-of-distribution. So why is it that we typically find "simple" solutions that generalize well?
In a previous post, I argued that the answer is "singularities" — minimum loss points with ill-defined tangents. It's the "nastiest" singularities that have the most outsized effect on learning and generalization in the limit of large data. These act as implicit regularizers that lower the effective dimensionality of the model.
Even after writing...
Epistemic status: Highly speculative hypothesis generation.
I had a similar, but slightly different picture. My picture was the tentacle covered blob.
When doing gradient descent from a random point, you usually get to a tentacle. (Ie the red trajectory)
When the system has been running for a while, it is randomly sampling from the black region. Most of the volume is in the blob. (The randomly wandering particle can easily find its way from the tentacle to the blob, but the chance of it randomly finding the end of a tentacle from within the blob is smal...
Thanks to Ian McKenzie and Nicholas Dupuis, collaborators on a related project, for contributing to the ideas and experiments discussed in this post. Ian performed some of the random number experiments.
Also thanks to Connor Leahy for feedback on a draft, and thanks to Evan Hubinger, Connor Leahy, Beren Millidge, Ethan Perez, Tomek Korbak, Garrett Baker, Leo Gao and various others at Conjecture, Anthropic, and OpenAI for useful discussions.
This work was carried out while at Conjecture.
I have received evidence from multiple credible sources that text-davinci-002 was not trained with RLHF.
The rest of this post has not been corrected to reflect this update. Not much besides the title (formerly "Mysteries of mode collapse due to RLHF") is affected: just mentally substitute "mystery method" every time "RLHF" is invoked...
OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.
I didn't get that impression from that when I read it - the NYer author and his friends prompted most of that, even if their friend Dan Selsam happens to work at OpenAI. (He seems to work on math LMs, nothing fiction or RL-related.) They were set up with the public Playground interface, so the OA insider role here was limited to showing them a few completions and trying to explain it; presumably they did the rest more remote and parti...