Adam Shimi

Full time independent deconfusion researcher ( in AI Alignment. (Also PhD in the theory of distributed computing).

If you're interested by some research ideas that you see in my posts, know that I keep private docs with the most compressed version of my deconfusion ideas in the process of getting feedback. I can give you access if you PM me!

A list of topics I'm currently doing deconfusion on:

  • Goal-directedness for discussing AI Risk
  • Myopic Decision Theories for dealing with deception (with Evan Hubinger)
  • Universality for many alignment ideas of Paul Christiano
  • Deconfusion itself to get better at it
  • Models of Languages Models to clarify the alignment issues surrounding them.


Epistemic Cookbook for Alignment
Reviews for the Alignment Forum
AI Alignment Unwrapped
Deconfusing Goal-Directedness
Toying With Goal-Directedness

Wiki Contributions


P₂B: Plan to P₂B Better

Exciting! Waiting for the next posts even more then.

P₂B: Plan to P₂B Better

Your proposed reformulation of convergent subgoals sounds interesting, but I see a big flaw in your post: you don't even state the applications you're doing the deconfusion for. And in my book, the applications are THE way of judging whether deconfusion is creating valuable knowledge. So I don't know yet if your framing will help with the sort of problems related to agency and goal-directedness that I think matter.

Reserving judgment until the follow up posts then.

P₂B: Plan to P₂B Better

In theory, optimal policies could be tabularly implemented. In this case, it is impossible for them to further improve their "planning."

That sounds wrong. Planning as defined in this post is sufficiently broad that acting like a planner makes you a planner. So if you unwrap a structural planner into a tabular policy, the latter would improve its planning (for example by taking actions that instrumentally help it accomplish the goal we can best ascribe it using the intentional stance).

Another way of framing the point IMO is that the OPs define planning in terms of computation instead of algorithm, and so planning better means facilitating or making the following part of the computation more efficient.

Teaching ML to answer questions honestly instead of predicting human answers

Reading this thread, I wonder if the apparent disagreement doesn't come from the use of the world "honestly". The way I understand Paul's statement of the problem is that "answer questions honestly" could be replaced by "answer questions appropriately to the best of your knowledge". And his point is that "answer what a human would have answered" is not a good proxy for that (yet still an incentivized one due to how we train neural nets)

From my reading of it, this post's proposal does provide some plausible ways to incentivize the model to actual search for appropriate answers instead of the ones human would have given, and I don't think it assumes the existence of true categories and/or essences.

General alignment plus human values, or alignment via human values?

I'm with Steve on the idea that there's a difference between broad human preferences (something like common sense?) and particular and exact human preferences (what would be needed for ambitious value learning).

Still, you (Stuart) made me realize that I didn't think explicitly about this need for broad human preferences in my splitting of the problem (be able to align, then point to what we want), but it's indeed implicit because I don't care about being able to do "anything", just the sort of things humans might want.

Safety-capabilities tradeoff dials are inevitable in AGI

Just posted an analysis of the epistemic strategies used in this post, which helps making the reasoning more explicit IMO.

Redwood Research’s current project

Hum, I see. And is your point that it should not create a problem because you're only doing comparison X vs Y and Z vs Y (where Y is the standard policy and X and Z are two of your conservative policies) but you don't really care about the comparison between X and Z?

Redwood Research’s current project

(This “better almost 50% of the time” property is one way of trying to operationalize “we don’t want the filtered policy to be worse”. It so happens that this property is actually kind of badly behaved, but in our case it seems fine, given that we’re always going to be comparing against a fixed unfiltered distribution.)

I've read the intransitive dice page, but I'm confused on how it might apply here? Like concretely, what are the dice in the analogy?

Selection Theorems: A Program For Understanding Agents

Just posted an analysis of the epistemic strategies underlying selection theorems and their applications. Might be interesting for people who want to go further with selection theorem, either by proving one or by critiquing one.

Load More