Makes perfect sense, thanks!

"Well, what if I take the variables that I'm given in a Pearlian problem and I just forget that structure? I can just take the product of all of these variables that I'm given, and consider the space of all partitions on that product of variables that I'm given; and each one of those partitions will be its own variable.

How can a partition be a variable? Should it be "part" instead?

29d

Partitions (of some underlying set) can be thought of as variables like this:
* The number of values the variable can take on is the number of parts in the
partition.
* Every element of the underlying set has some value for the variable, namely,
the part that that element is in.
Another way of looking at it: say we're thinking of a variablev:S→Das a function
from the underlying setStov's domainD. Then we can equivalently think ofvas the
partition{{s∈S∣v(s)=d}∣d∈D}∖∅ofSwith (up to)|D|parts.
In what you quoted, we construct the underlying set by taking all possible
combinations of values for the "original" variables. Then we take all partitions
of that to produce all "possible" variables on that set, which will include the
original ones and many more.

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

**ETA: Koen recommends reading ****Counterfactual Planning in AGI Systems** first (or instead of) Corrigibility with Utility Preservation)

Update: I started reading your paper "Corrigibility with Utility Preservation".^{[1]} My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".

Quick thoughts after reading less than half:

AFAICT,^{[2]} this is a mathematica...

29d

Corrigibility with Utility Preservation is not the paper I would recommend you
read first, see my comments included in the list I just posted.
To comment on your quick thoughts:
* My later papers spell out the ML analog of the solution in `Corrigibility
with' more clearly.
* On your question of Do you have an account of why MIRI's supposed
impossibility results (I think these exist?) are false?: Given how
re-tellings in the blogosphere work to distort information into more extreme
viewpoints, I am not surprised you believe these impossibility results of
MIRI exist, but MIRI does not have any actual mathematically proven
impossibility results about corrigibility. The corrigibility paper proves
that one approach did not work, but does not prove anything for other
approaches. What they have is that 2022 Yudkowsky is on record expressing
strongly held beliefs that corrigibility is very very hard, and (if I recall
correctly) even saying that nobody has made any progress on it in the last
ten years. Not everybody on this site shares these beliefs. If you formalise
corrigibility in a certain way, by formalising it as producing a full 100%
safety, no 99.999% allowed, it is trivial to prove that a corrigible AI
formalised that way can never provably exist, because the humans who will
have to build, train, and prove it are fallible. Roman Yampolskiy has done
some writing about this, but I do not believe that this kind or reasoning is
at the core of Yudkowsky's arguments for pessimism.
* On being misleadingly optimistic in my statement that the technical problems
are mostly solved: as long as we do not have an actual AGI in real life, we
can only ever speculate about how difficult it will be to make it corrigible
in real life. This speculation can then lead to optimistic or pessimistic
conclusions. Late-stage Yudkowsky is of course well-known for speculating
that everybody who shows some opt

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

To be more specific about the technical problem being mostly solved: there are a bunch of papers outlining corrigibility methods that are backed up by actual mathematical correctness proofs

Can you link these papers here? No need to write anything, just links.

49d

OK, Below I will provide links to few mathematically precise papers about AGI
corrigibility solutions, with some comments. I do not have enough time to write
short comments, so I wrote longer ones.
This list or links below is not a complete literature overview. I did a
comprehensive literature search on corrigibility back in 2019 trying to find all
mathematical papers of interest, but have not done so since.
I wrote some of the papers below, and have read all the rest of them. I am not
linking to any papers I heard about but did not read (yet).
Math-based work on corrigibility solutions typically starts with formalizing
corrigibility, or a sub-component of corrigibility, as a mathematical property
we want an agent to have. It then constructs such an agent with enough detail to
show that this property is indeed correctly there, or at least there during some
part of the agent lifetime, or there under some boundary assumptions.
Not all of the papers below have actual mathematical proofs in them, some of
them show correctness by construction. Correctness by construction is superior
to having to have proofs: if you have correctness by construction, your notation
will usually be much more revealing about what is really going on than if you
need proofs.
Here is the list, with the bold headings describing different approaches to
corrigibility.
Indifference to being switched off, or to reward function updates
Motivated Value Selection for Artificial Agents
[https://www.fhi.ox.ac.uk/wp-content/uploads/2015/03/Armstrong_AAAI_2015_Motivated_Value_Selection.pdf]
introduces Armstrong's indifference methods for creating corrigibility. It has
some proofs, but does not completely work out the math of the solution to a
this-is-how-to-implement-it level.
Corrigibility [https://intelligence.org/files/Corrigibility.pdf] tried to work
out the how-to-implement-it details of the paper above but famously failed to do
so, and has proofs showing that it failed to do so. This paper som

110d

ETA: Koen recommends reading Counterfactual Planning in AGI Systems
[https://arxiv.org/abs/2102.00834] first (or instead of) Corrigibility with
Utility Preservation
[https://www.alignmentforum.org/posts/3uHgw2uW6BtR74yhQ/new-paper-corrigibility-with-utility-preservation]
)
Update: I started reading your paper "Corrigibility with Utility Preservation
[https://www.alignmentforum.org/posts/3uHgw2uW6BtR74yhQ/new-paper-corrigibility-with-utility-preservation]
".[1] [#fna25wvqzjif5]My guess is that readers strapped for time should read
{abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is
just setting up the standard utility-maximization framework and defining
"superintelligent" as "optimal utility maximizer".
Quick thoughts after reading less than half:
AFAICT,[2] [#fnjhjn52ouzc]this is a mathematical solution to corrigibility in a
toy problem, and not a solution to corrigibility in real systems. Nonetheless,
it's a big deal if you have in fact solved the utility-function-land version
which MIRI failed to solve.[3] [#fnblaynte99rg]Looking to applicability, it may
be helpful for you to spell out the ML analog to your solution (or point us to
the relevant section in the paper if it exists). In my view, the hard part of
the alignment problem is deeply tied up with the complexities of the {training
procedure --> model} map, and a nice theoretical utility function is neither
sufficient nor strictly necessary for alignment (though it could still be
useful).
So looking at your claim that "the technical problem [is] mostly solved", this
may or may not be true for the narrow sense (like "corrigibility as a
theoretical outer-objective problem in formally-specified environments"), but
seems false and misleading for the broader practical sense ("knowing how to make
an AGI corrigible in real life").[4] [#fn8x8iiiwxmco]
Less important, but I wonder if the authors of Soares et al agree with your
remark in this excerpt[5] [#fn5tcc8hywnc5]:
"In particular, [

- Try to improve my evaluation process so that I can afford to do wider searches without taking excessive risk.

Improve it with respect to what?

My attempt at a framework where "improving one's own evaluator" and "believing in adversarial examples to one's own evaluator" make sense:

- The agent's allegiance is to some idealized utility function (like CEV). The agent's internal evaluator is "trying" to approximate by reasoning heuristically. So now we ask Eval to evaluate the plan "do argmax w.r.t

38d

Vivek -- I replied to your comment in appendix C of today's follow-up post,
Alignment allows imperfect decision-influences and doesn't require robust
grading
[https://www.lesswrong.com/posts/rauMEna2ddf26BqiE/alignment-allows-imperfect-decision-influences-and-doesn-t]
.

48d

The way you write this (especially the last sentence) makes me think that you
see this attempt as being close to the only one that makes sense to you atm.
Which makes me curious:
* Do you think that you are internally trying to approximate your ownUideal?
* Do you think that you have ever made the decision (either implicitly or
explicitly) to not eval all or most plans because you don't trust your
ability to do so for adversarial examples (as opposed to tractability issues
for example)?
* Can you think of concrete instances where you improved your own Eval?
* Can you think of concrete instances where you thought you improved you own
Eval but then regretted it later?
* Do you think that your own changes to your eval have been moving in the
direction of yourUideal?

89d

This is tempting, but the problem is that I don't know what my idealized utility
function is (e.g., I don't have a specification for CEV that I think would be
safe or ideal to optimize for), so what does it mean to try to approximate it?
Or consider that I only read about CEV one day in a blog, so what was I doing
prior to that? Or if I was supposedly trying to approximate CEV, I can change my
mind about it if I realized that it's a bad idea, but how does that fit into the
framework?
My own framework is something like this:
* The evaluation process is some combination of gut, intuition, explicit
reasoning (e.g. cost-benefit analysis), doing philosophy, and cached answers.
* I think there are "adversarial inputs" because I've previously done things
that I later regretted, due to evaluating them highly in ways that I no
longer endorse. I can also see other people sometimes doing obviously crazy
things (which they may or may not later regret). I can see people (including
myself) being persuaded by propaganda / crazy memes, so there must be a risk
of persuading myself with my own bad ideas.
* I can try to improve my evaluation process by doing things like 1. look for
patterns in my and other people's mistakes
2. think
about ethical dilemmas / try to resolve conflicts between my evaluative
subprocesses
3. do more
philosophy (think/learn about ethical theories, metaethics, decision
theory, philosophy of mind, etc.)
4. talk
(selectively) to other people
5. try to
improve how I do explicit reasoning or philosophy

Yeah, the right column should obviously be all 20s. There must be a bug in my code^{[1]} :/

I like to think of the argmax function as something that takes in a distribution on probability distributions on with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.

Take the following hypothesis :

If I add this into with weight , then the middle column is still near...

Now, let's consider the following modification: Each hypothesis is no longer a distribution on , but instead a distribution on some coarser partition of . Now is still well defined

Playing around with this a bit, I notice a curious effect (ETA: the numbers here were previously wrong, fixed now):

The reason the middle column goes to zero is that hypothesis A puts 60% on the rightmost column, and hypothesis B puts 40% on the leftmost, and neither cares about the middle column specifically.

But philosophically, what d...

211d

I think your numbers are wrong, and the right column on the output should say
20% 20% 20%.
The output actually agrees with each of the components on every event in that
component's sigma algebra. The input distributions don't actually have any
conflicting beliefs, and so of course the output chooses a distribution that
doesn't disagree with either.
I agree that the 0s are a bit unfortunate.
I think the best way to think of the type of the object you get out is not a
probability distribution onW,but what I am calling a partial probability
distribution onW. A partial probability distribution is a partial function from2
W→[0,1]that can be completed to a full probability distribution onW(with some
sigma algebra that is a superset of the domain of the partial probability
distribution.
I like to think of the argmax function as something that takes in a distribution
on probability distributions onWwith different sigma algebras, and outputs a
partial probability distribution that is defined on the set of all events that
are in the sigma algebra of (and given positive probability by) one of the
components.
One nice thing about this definition is that it makes it so the argmax always
takes on a unique value. (proof omitted.)
This doesn't really make it that much better, but the point here is that this
framework admits that it doesn't really make much sense to ask about the
probability of the middle column. You can ask about any of the events in the
original pair of sigma algebras, and indeed, the two inputs don't disagree with
the output at all on any of these sets.

A framing I wrote up for a debate about "alignment tax":

**"Alignment isn't solved" regimes:**- Nobody knows how to make an AI which is {safe, general, and broadly superhuman}, with any non-astronomical amount of compute
- We know how to make an aligned AGI with 2 to 25 OOMs more compute than making an unaligned one

**"Alignment tax" regimes:**- We can make an aligned AGI, but it requires a compute overhead in the range 1% - 100x. Furthermore, the situation remains multipolar and competitive for a while.
- The alignment tax is <0.001%, so it's not a concern.
- The leadi

If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.

What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?

`Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]`

is the probability that, for a random policy`π∈ξ`

, that policy has worse utility than the policy`G*`

its program dictates; in essence, how good`G`

's policies are compared to random policy selection

What prior over policies?

given

`g(G|U)`

, we caninfer the probability that an agent

Ghas a given utility function`, as`

U`Pr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)])`

where`∝`

means "is proportional to" and`K(U)`

is the kolmogorov complexity of utility function`U`

.

Suppose the prior over policies is max-entropy (uniform over all action seq...

I have seen one person be surprised (I think twice in the same convo) about what progress had been made.

ETA: Our observations are compatible. It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.

All in all, I don't think my original post held up well. I guess I was excited to pump out the concept quickly, before the dust settled. Maybe this was a mistake? Usually I make the ~opposite error of never getting around to posting things.

The perspective and the computations that are presented here (which in my opinion are representative of the mathematical parts of the linked posts and of various other unnamed posts) do not use any significant facts about neural networks or their architecture.

You're correct that the written portion of the Information Loss --> Basin flatness post doesn't use any non-trivial facts about NNs. The purpose of the written portion was to explain some mathematical groundwork, which is then used for the non-trivial claim. (I did not know at the time ...

21mo

All in all, I don't think my original post held up well. I guess I was excited
to pump out the concept quickly, before the dust settled. Maybe this was a
mistake? Usually I make the ~opposite error of never getting around to posting
things.

Note that, for rational *altruists* (with nothing vastly better to do like alignment), voting can be huge on CDT grounds -- if you actually do the math for a swing state, the leverage per voter is really high. In fact, I think the logically counterfactual impact-per-voter tends to be *lower* than the impact calculated by CDT, if the election is very close.

Exciting work! A minor question about the paper:

Does this mean that it writes a projection of S1's positional embedding to S2's residual stream? Or is it meant to say "writing *to*** **the position [residual stream] of [S2]"? Or something else?

falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix

What *is* your list of problems by urgency, btw? Would be curious to know.

From this paper, "Theoretical work limited to ReLU-type activation functions, showed that in overparameterized networks, all global minima lie in a connected manifold (Freeman & Bruna, 2016; Nguyen, 2019)"

So for overparameterized nets, the answer is probably:

- There is only one solution manifold, so there
*are no separate basins*. Every solution is connected. - We can salvage the idea of "basin volume" as follows:
- In the dimensions perpendicular to the manifold, calculate the basin cross-section using the Hessian.
- In the dimensions parallel to the manifol

The loss is defined over all dimensions of parameter space, so is still a function of all 3 x's. You should think of it as . It's thickness in the direction is **infinite**, not zero.

Here's what a zero-determinant Hessian corresponds to:

The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this:

- Regularization / weight decay prov

I will split this into a math reply, and a reply about the big picture / info loss interpretation.

Math reply:

Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.

I agree that assuming MSE, and looking at a local minimum, you have

This is still false! *Edit: I am now confused, I don't know if it is false or not.*

You are conflating and . Adding disa...

16mo

Thanks again for the reply.
In my notation, something like∇lorJfare functions in and of themselves. The
function∇levaluates to zero at local minima ofl.
In my notation, there isn't any such thing as∇fl.
But look, I think that this is perhaps getting a little too bogged down for me
to want to try to neatly resolve in the comment section, and I expect to be away
from work for the next few days so may not check back for a while. Personally, I
would just recommend going back and slowly going through the mathematical
details again, checking every step at the lowest level of detail that you can
and using the notation that makes most sense to you.

Thanks for this reply, its quite helpful.

I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

Ah nice, didn't know what it was called / what field it's from. I should clarify that "key result" here just meant "key result of the math so far -- pay attention", not "key result of the whole post" or "profound/original".

...The Jacobian matrix is what you call

26mo

Thanks for the substantive reply.
First some more specific/detailed comments: Regarding the relationship with the
loss and with the Hessian of the loss, my concern sort of stems from the fact
that the domains/codomains are different and so I think it deserves to be
spelled out. The loss of a model with parametersθ∈Θcan be described by
introducing the actual function that maps the behavior to the real numbers,
right? i.e. given some actual functionl:Ok→Rwe have:
L:Θf⟶Okl⟶Ri.e. it'slthat might be something like MSE, but the function 'L' is of course
more mysterious because it includes the way that parameters are actually mapped
to a working model. Anyway, to perform some computations with this, we are
looking at an expression like
L(θ)=l(f(θ))We want to differentiate this twice with respect toθessentially. Firstly, we
have
∇L(θ)=∇l(f(θ))Jf(θ)where - just to keep track of this - we've got:
(1×N)vector=[(1×k)vector][(k×N)matrix]Or, using 'coordinates' to make it
explicit:
∂∂θiL(θ)=∇l(f(θ))⋅∂f∂θi=k∑p=1∇pl(f(θ))⋅∂fp∂θifori=1,…,N. Then forj=1,…,Nwe differentiate again:
∂2∂θj∂θiL(θ)=k∑p=1k∑q=1∇q∇pl(f(θ))∂fq∂θj∂fp∂θi+k∑p=1∇pl(f(θ))∂fp∂θj∂θiOr,
Hess(L)(θ)=Jf(θ)T[Hess(l)(f(θ))]Jf(θ)+∇l(f(θ))D2f(θ)This is now at the level of(N×N)matrices. Avoiding getting into any depth about
tensors and indices, theD2fterm is basically a(N×N×k)tensor-type object and it's
paired with∇lwhich is a(1×k)vector to give something that is(N×N).
So what I think you are saying now is that if we are at a local minimum forl,
then the second term on the right-hand side vanishes (because the term includes
the first derivatives ofl, which are zero at a minimum). You can see however
that if the Hessian oflis not a multiple of the identity (like it would be for
MSE), then the claimed relationship does not hold, i.e. it is not the case that
in general, at a minima ofl, the Hessian of the loss is equal to a constant
times(Jf)TJf. So maybe you really do want to explicitly assume something like
MSE.
I ag

I'm pretty sure my framework doesn't apply to grokking. I usually think about training as ending once we hit zero training loss, whereas grokking happens much later.

I'll reply to the rest of your comment later today when I have some time

About the contours: While the graphic shows a finite number of contours with some spacing, in reality there are infinite contour planes and they completely fill space (as densely as the reals, if we ignore float precision). So at literally every point in space there is a blue contour, and a red one which exactly coincides with it.

16mo

I'll reply to the rest of your comment later today when I have some time

Yup, seems correct.

Yeah, this seems roughly correct, and similar to what I was thinking. There is probably even a direct connection to the "asymptotic equipartitioning" math, via manifold counts containing terms like "A choose B" from permutations of neurons.

Yep, I am assuming MSE loss generally, but as you point out, any smooth and convex loss function will be locally approximately quadratic. "Saddle points all the way down" isn't possible if a global min exists, since a saddle point implies the existence of an adjacent lower point. As for asymptotes, this is indeed possible, especially in classification tasks. I have basically ignored this and stuck to regression here.

I might return to the issue of classification / solutions at infinity in a later post, but for now I will say this: &...

Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?

Oh, this is definitely not what I meant.

"Betting odds" == Your actual belief after factoring in other people's opinions

"Inside view" == What your models predict, before factoring in other opinions or the possibility of being completely wrong

Though I understood what you meant, perhaps a clearer terminology is all-things-considered beliefs vs. independent impressions.

Checked; the answer is no: https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G

Nice! Do you know if the author of that post was involved in RASP?

19mo

Checked; the answer is no:
https://www.lesswrong.com/posts/Lq6jo5j9ty4sezT7r/teaser-hard-coding-transformer-models?commentId=ET24eiKK6FSJNef7G

29mo

They apparently reinvented RASP independently.

What probability do you assign to the proposition "Prosaic alignment will fail"?

- Purely based on your inside view model
- After updating on everyone else's views

Same question for:

"More than 50% of the prosaic alignment work done by the top 7 researchers is nearly useless"

Is a metal bar an optimizer? Looking at the temperature distribution, there is a clear set of target states (states of uniform temperature) with a much larger basin of attraction (all temperature distributions that don't vaporize the bar).

I suppose we could consider the second law of thermodynamics to be the true optimizer in this case. The consequence is that any* closed physical system is trivially an optimizing system towards higher entropy.

In general, it seems like this optimization criterion is very easy to satisfy if we don't specify what...

This read to me like you were saying "In Vivek's framework, value shards explain away .." and I was confused. I now think you mean "My take on Vivek's is that value s... (read more)