x

AI ALIGNMENT FORUM

AF

Nandi — AI Alignment Forum

Nandi

Nandi

Message

232

Ω

11

3

3

6y

Nandi

232

Ω

11

6y

Machine Unlearning Evaluations as Interpretability Benchmarks

by NickyP and Nandi

Interpreting Models by Ablation. Image generated by DALL-E 3. Introduction Interpretability in machine learning, especially in language models, is an area with a large number of contributions. While this can be quite useful for improving our understanding of models, one issue is that there is the lack of robust benchmarks...

Oct 23, 2023•33

Acknowledging Human Preference Types to Support Value Learning

We analyze the usefulness of the framework of preference types [Berridge et al. 2009] to value learning by an artificial intelligence. In the context of AI the purpose of value learning is giving an AI goals aligned with humanity. We will lay the groundwork for establishing how human preferences of...

Nov 13, 2018•34