What judgments about which of our revealed preferences are biases and which are "true" preferences should we try to put into the initial conditions of a value learning process?
How should we elicit higher-order preferences from human models, and how should we translate them into changes in the parameters of the value learning procedure?

I realize being able to infer preferences from human actions could be a really useful ability. I'm curious how important you think those are vs. direct conversations with AI about values? (Or do you consider conversations simply a subset of the data that an AI could infer preferences from?)

It occurred to me that since language models are getting pretty good now, there is interesting opportunity now for starting to experiment with such human-and-AI conversations about values. This thought led me to have the following dialogue with the OpenAI Codex just now:

[Me:]
Here are some things I value:
- I like exercise
- I like getting to work / do something productive during the day
- I like going for a walk at the end of the day
- I like having dinner with my wife in the evening
- I like getting enough sleep

What are my values?
[Codex:]
- I like staying active and being productive
- I like spending time with my wife and family
- I like getting a good night's sleep
[Me:]
Ok that's pretty good. But I remembered I also like watching a show before bed. So what are my values now?
[Codex:]
- I like staying active, being productive, and spending time with my wife and family
- I like getting a good night's sleep
[Me:]
But you forgot the part about watching a show before bed...
[Codex:]
- I like staying active, being productive, spending time with my wife and family, and watching a show before bed
[Me:]
Ok that sounds good

Personally I feel more comfortable with the idea of an AGI that I could stop at any time to have a conversation to clarify what my values are like this. I imagine a lot could be clarified about higher-order preferences this way too, although sometimes the AI might need to read between the lines (e.g. "I know before I said I want to quit smoking but now I really want a cigarette!").

Do you agree that this is a mode of operation we should be aiming for? Perhaps there are issues with this seeming reasonable for toy examples like the above, but we need to be worrying about a future where the AGI is dealing with issues so sophisticated or removed from our everyday experience that we won't know how to give it guidance?

Overall I found your research plan clear and easy to follow. It was a bit long (as it probably should be) so it took me awhile to find the time to read it thoroughly. I've always been attracted to the value learning / CIRL / etc. general direction of research and find it intuitively compelling, so I'm excited to see what comes of your work here.

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

16

New year, new research agenda post

16

What do I think the future is like?

What's the broad plan?

Is solving value learning really necessary?

Do we really not know how to do value learning?

What do I want us to make progress on?

Can it be practical?

What does this research process look like?

How is this approach dangerous?

Appendix: What's this similar to?