Vivek Hebbar

Wiki Contributions


[Link] A minimal viable product for alignment

Is the claim here that the 2^200 "persuasive ideas" would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)?  Or do you just mean that they would look promising in a shorter evaluation done for training purposes?

When people ask for your P(doom), do you give them your inside view or your betting odds?

Oh, this is definitely not what I meant.

"Betting odds" == Your actual belief after factoring in other people's opinions

"Inside view" == What your models predict, before factoring in other opinions or the possibility of being completely wrong

Transformer inductive biases & RASP

Checked; the answer is no:

Transformer inductive biases & RASP

Nice!  Do you know if the author of that post was involved in RASP?

Discussion with Eliezer Yudkowsky on AGI interventions

What probability do you assign to the proposition "Prosaic alignment will fail"?

  1. Purely based on your inside view model
  2. After updating on everyone else's views

Same question for:

 "More than 50% of the prosaic alignment work done by the top 7 researchers is nearly useless"

The ground of optimization

Is a metal bar an optimizer?  Looking at the temperature distribution, there is a clear set of target states (states of uniform temperature) with a much larger basin of attraction (all temperature distributions that don't vaporize the bar).

I suppose we could consider the second law of thermodynamics to be the true optimizer in this case.  The consequence is that any* closed physical system is trivially an optimizing system towards higher entropy.

In general, it seems like this optimization criterion is very easy to satisfy if we don't specify what exactly we care about as a meaningful aspect of the system.  Even the bottle cap 'optimizes' for trivial things like maintaining its shape (against the perturbation of elastic deformation).  

Do you think this will become a problem when using this definition for AI?  For example, we might find that a particular program incidentally tends to 'optimize' certain simple measures such as the average magnitude of network weights, or some other functions of weights, loss, policy, etc. to a set point/range.  We may then find slightly more complex things being optimized that look like sub-goals (which could in a certain context be unwanted or dangerous).  How would we know where to draw the line?  It seems like the definition would classify lots of things as optimization, and it would be up to us to decide which kinds are interesting or concerning and which ones are as trivial as the bottle cap maintaining its shape.

That being said, I really like this definition.  I just think it should be extended to classify the interestingness of a given optimization.  An AI agent which competently pursues complex goals is a much more interesting optimizer than a metal bar, even though the bar seems more robust (deleting a tiny piece of metal won't stop it from conducting; deleting a tiny piece in the AI's computer could totally disable it).

Also a nitpick on the section about whether the universe is an optimizing system:

I don't think it is correct to say that the target space is almost as big as the basin of attraction.  Either:

  • We use area to represent the number of macroscopic states -- in this case, the target space is extremely small (one state only(?) -- an ultra-low-density bath of particles with uniform temperature).  The universe is an extremely powerful optimizer from this perspective, with the caveat that it takes almost forever to achieve its target.
  • We use area to represent the number of microscopic states (as I think you intended).  In this case, I think the target space is exactly identical to the basin of attraction.  Low entropy microstates are not any less likely than high entropy microstates -- there just happen to be astronomically fewer of them.  There is no 'optimizing force' pushing the universe out of these states.  From the microstate perspective, there is no reason to exclude them from the target zone, since any small and unremarkable subset of the target space will display the property that the system tends to stumble out of it at random.

I would say that the first lens is almost always better than the second, since macro-states are what we actually care about and how we naturally divide the configuration space of a system.

Finally, just want to say this is an amazing post!  I love the style as well as the content.  The diagrams make it really easy to get an intuitive picture.


*Unsure about the existence of exceptions (can an isolated system be contrived that fails to reach the global max for entropy?)