Daniel Herrmann

Still no Lie Detector for LLMs

Background This post is a short version of a paper we wrote that you can find here. You can read this post to get the core ideas. You can read the paper to go a little deeper. The paper is about probing decoder-only LLMs for their beliefs, using either unsupervised methods (like CCS from Burns) or supervised methods. We give both philosophical/conceptual reasons we are pessimistic and demonstrate some empirical failings using LLaMA 30b. By way of background, we’re both philosophers, not ML people, but the paper is aimed at both audiences. Introduction One child says to the other “Wow! After reading some text, the AI understands what water is!”… The second child says “All it understands is relationships between words. None of the words connect to reality. It doesn’t have any internal concept of what water looks like or how it feels to be wet. …” … Two angels are watching [some] chemists argue with each other. The first angel says “Wow! After seeing the relationship between the sensory and atomic-scale worlds, these chemists have realized that there are levels of understanding humans are incapable of accessing.” The second angel says “They haven’t truly realized it. They’re just abstracting over levels of relationship between the physical world and their internal thought-forms in a mechanical way. They have no concept of [$!&&!@] or [#@&#**]. You can’t even express it in their language!” --- Scott Alexander, Meaningful Do large language models (LLMs) have beliefs? And, if they do, how might we measure them? These questions are relevant as one important problem that plagues current LLMs is their tendency to generate falsehoods with great conviction. This is sometimes called lying and sometimes called hallucinating. One strategy for addressing this problem is to find a way to read the beliefs of an LLM directly off its internal state. Such a strategy falls under the broad umbrella of model interpretability, but we can t

50Jul 18, 2023

Daniel Herrmann

Message

Daniel Herrmann is a decision theorist, formal epistemologist, and philosopher of AI. He is currently an assistant professor at UNC Chapel Hill.

He was a PIBBSS fellow. He runs the not-quite-active philosophy blog The Crow's Nest, where he writes for both a philosophical and a broad...

365

Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker

Summary: This post outlines how a view we call subjective naturalism[1] poses challenges to classical Savage-style decision theory. Subjective naturalism requires (i) richness (the ability to represent all propositions the agent can entertain, including self-referential ones) and (ii) austerity (excluding events the agent deems impossible). It is one way of...

Feb 4, 202545

Still no Lie Detector for LLMs

Jul 18, 202350

Bridging Expected Utility Maximization and Optimization

Background This is the second of our (Ramana, Abram, Josiah, Daniel) posts on our PIBBSS research. Our previous post outlined five potential projects that we were considering pursuing this summer. Our task since then has been to make initial attempts at each project. These initial attempts help us to clarify...

Aug 5, 202225

Formal Philosophy and Alignment Possible Projects

Context We (Ramana, Abram, Josiah, Daniel) are working together as part of PIBBSS this summer. The goal of the PIBBSS fellowship program is to bring researchers in alignment (in our case, Ramana and Abram) together with researchers from other relevant fields (in our case, Josiah and Daniel, who are both...

Jun 30, 202234

Choosing to Choose?

This is related to wireheading, utility functions, taste, and rationality. It is a series of puzzles meant to draw attention to certain tensions in notions of rationality and utility functions for embedded agents. I often skip the particulars of embedded situations. I often show multiple sides of an argument without...

Jul 10, 201810

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Daniel Herrmann

Daniel Herrmann

Daniel Herrmann

Still no Lie Detector for LLMs

Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker

Formal Philosophy and Alignment Possible Projects

Bridging Expected Utility Maximization and Optimization

Daniel Herrmann

Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker

Still no Lie Detector for LLMs

Bridging Expected Utility Maximization and Optimization

Formal Philosophy and Alignment Possible Projects

Choosing to Choose?

Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker

Still no Lie Detector for LLMs

Bridging Expected Utility Maximization and Optimization

Formal Philosophy and Alignment Possible Projects

Choosing to Choose?

Still no Lie Detector for LLMs

Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker

Formal Philosophy and Alignment Possible Projects

Bridging Expected Utility Maximization and Optimization