abramdemski — AI Alignment Forum

Coherent Care

I've been trying to gather my thoughts for my next tiling theorem (agenda write-up here; first paper; second paper; recent project update). I have a lot of ideas for how to improve upon my work so far, and trying to narrow them down to an achievable next step has been...

Feb 2741

Condensation

Condensation: a theory of concepts is a model of concept-formation by Sam Eisenstat. Its goals and methods resemble John Wentworth's natural abstractions/natural latents research.[1] Both theories seek to provide a clear picture of how to posit latent variables, such that once someone has understood the theory, they'll say "yep, I...

Nov 9, 2025156

Myopia Mythology

It's been a while since I wrote about myopia! My previous posts about myopia were "a little crazy", because it's not this solid well-defined thing; it's a cluster of things which we're trying to form into a research program. This post will be "more crazy". The Good/Evil/Good Spectrum "Good" means...

Nov 8, 202538

Comparing Payor & Löb

Löb's Theorem: * If ⊢□x→x, then ⊢x. * Or, as one formula: □(□x→x)→□x Payor's Lemma: * If ⊢□(□x→x)→x, then ⊢x. * Or, as one formula: □(□(□x→x)→x)→□x. In the following discussion, I'll say "reality" to mean x, "belief" to mean □x, "reliability" to mean □x→x (ie, belief is reliable when belief...

Nov 8, 202554

Geometric UDT

This post owes credit to discussions with Caspar Oesterheld, Scott Garrabrant, Sahil, Daniel Kokotajlo, Martín Soto, Chi Nguyen, Lukas Finnveden, Vivek Hebbar, Mikhail Samin, and Diffractor. Inspired in particular by a discussion with cousin_it. Short version: The idea here is to combine the Nash bargaining solution with Wei Dai's UDT...

Nov 6, 202529

Weak-To-Strong Generalization

I will be discussing weak-to-strong generalization with Sahil on Monday, November 3rd, 2025, 11am Pacific Daylight Time. You can join the discussion with this link. Weak-to-strong generalization is an approach to alignment (and capabilities) which seeks to address the scarcity of human feedback by using a weak model to teach...

Nov 2, 202532

What, if not agency?

Sahil has been up to things. Unfortunately, I've seen people put effort into trying to understand and still bounce off. I recently talked to someone who tried to understand Sahil's project(s) several times and still failed. They asked me for my take, and they thought my explanation was far easier...

Sep 15, 2025163

Abram Demski

Abram Demski

Abram Demski

The Parable of Predict-O-Matic

Alignment Research Field Guide

Embedded Agents

Modern Transformers are AGI, and Human-Level

Abram Demski

The Parable of Predict-O-Matic

Alignment Research Field Guide

Embedded Agents

Modern Transformers are AGI, and Human-Level

Coherent Care

Condensation

Myopia Mythology

Comparing Payor & Löb

Geometric UDT

Weak-To-Strong Generalization

What, if not agency?