AI ALIGNMENT FORUM
AF

Wikitags

AI Evaluations

Edited by Raymond Arnold, duck_master last updated 1st Aug 2023

AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)

Behavioral evaluations assess a model's abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer's ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model's behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include:

  • developing a method-agnostic standard to demonstrate sufficient understanding of a model
  • ensuring that the level of understanding is adequate to catch dangerous failure modes
  • finding the right balance between behavioral and understanding-based evaluations.

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

See also:

AI Risk
Subscribe
1
Subscribe
1
Interpretability (ML & AI)
Discussion0
Discussion0
Posts tagged AI Evaluations
75When can we trust model evaluations?
Evan Hubinger
2y
8
42The case for more ambitious language model evals
Arun Jose
1y
9
89Announcing Apollo Research
Marius Hobbhahn, Beren Millidge, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer
2y
4
66Thoughts on sharing information about language model capabilities
Paul Christiano
2y
20
73Towards understanding-based safety evaluations
Evan Hubinger
2y
7
31Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses
Alex Turner
6mo
0
122Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Evan Hubinger, Nicholas Schiefer, Carson Denison, Ethan Perez
2y
14
11Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Soroush Pour, rusheb, Quentin Feuillade--Montixi, Arush Tagade, Stephen Casper
2y
1
129What’s the short timeline plan?
Marius Hobbhahn
6mo
14
84More information about the dangerous capability evaluations we did with GPT-4 and Claude.
Beth Barnes
2y
13
97Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack, Alex Turner
1y
20
89Frontier Models are Capable of In-context Scheming
Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer, Mikita Balesni
7mo
9
80AI companies' eval reports mostly don't support their claims
Zach Stein-Perlman
1mo
2
71Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn
4mo
1
70ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks
Beth Barnes
2y
4
Load More (15/77)
Add Posts