Vivek Hebbar

Agreed. To give a concrete toy example: Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}. If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi. The probability of "B" keeps dropping, and the probability of *ever* seeing a "B" asymptotes to 50% (as it must).

This is the case for *perfect* predictors, but there could be some argument about *particular kinds of imperfect predictors *which supports the claim in the post.

In section 3.7 of the paper, it seems like the descriptions ("6 in 5", etc) are inconsistent across the image, the caption, and the paragraph before them. What are the correct labels? (And maybe fix the paper if these are typos?)

In ML terms, nearly-all the informational work of learning what “apple” means must be performed by unsupervised learning, not supervised learning. Otherwise the number of examples required would be far too large to match toddlers’ actual performance.

I'd guess the vast majority of the work (relative to the max-entropy baseline) is done by the *inductive bias.*

As I understand Vivek's framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. "If

`food nearby`

and`hunger>15`

, then be more likely to`go to food`

") and thenalsoa sophisticated utility function (e.g. something like CEV). It's shards all the way down, and all the way up.^{[10]}

This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused. I now think you mean "My take on Vivek's is that value shards explain away ..". Maybe reword for clarity?

(Might have a substantive reply later)

Makes perfect sense, thanks!

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.

How can a partition be a variable? Should it be "part" instead?

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

**ETA: Koen recommends reading ****Counterfactual Planning in AGI Systems** before (or instead of) Corrigibility with Utility Preservation

Update: I started reading your paper "Corrigibility with Utility Preservation".^{[1]} My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,^{[2]} this is a mathematical solution to corrigibility in a toy problem, and *not *a solution to corrigibility in real systems. Nonetheless, it's a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.^{[3]} Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists). In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful).

So looking at your claim that "the technical problem [is] mostly solved", this may or may not be true for the narrow sense (like "corrigibility as a theoretical outer-objective problem in formally-specified environments"), but seems false and misleading for the broader practical sense ("knowing how to make an AGI corrigible in real life").^{[4]}

Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt^{[5]}:

"In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent's decision procedure] to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this conclusion is due to the use of a Platonic agent model."

^{^}Btw, your writing is admirably concrete and clear.

Errata: Subscripts seem to broken on page 9, which significantly hurts readability of the equations. Also there is a double-typo "I this paper, we the running example of a toy universe" on page 4.

^{^}Assuming the idea is correct

^{^}Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?

^{^}I'm not necessarily accusing you of any error (if the contest is fixated on the utility function version), but it was misleading to me as someone who read your comment but not the contest details.

^{^}Portions in [brackets] are insertions/replacements by me

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here? No need to write anything, just links.

- Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?

My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense:

- The agent's allegiance is to some idealized utility function (like CEV). The agent's internal evaluator is "trying" to approximate by reasoning heuristically. So now we ask Eval to evaluate the plan "do argmax w.r.t. Eval over a bunch of plans". Eval reasons that, due to the the way that Eval works, there should exist "adversarial examples" that score very highly on Eval but low on . Hence, Eval concludes that is low, where plan = "do argmax w.r.t. Eval". So the agent doesn't execute the plan "search widely and argmax".
- "Improving " makes sense because Eval will gladly replace itself with if it believes that is a better approximation for (and hence replacing itself will cause the outcome to score better on )

Are there other distinct frameworks which make sense here? I look forward to seeing what design Alex proposes for "value child".

Any idea why "cheese Euclidean distance to top-right corner" is so important? It's surprising to me because the convolutional layers should apply the same filter everywhere.