David Krueger

David Krueger's Comments

[AN #96]: Buck and I discuss/argue about AI Alignment

Without having read the transcript either, this sounds like it's focused on near-term issues with autonomous weapons, and not meant to be a statement about the longer-term role autonomous weapons systems might play in increasing X-risk.

[AN #96]: Buck and I discuss/argue about AI Alignment
autonomous weapons are unlikely to directly contribute to existential risk

I disagree, and would've liked to see this argued for.

Perhaps the disagreement is at least somewhat about what we mean by "directly contribute".

Autonomous weapons seem like one of the areas where competition is most likely to drive actors to sacrifice existential safety for performance. This is because the stakes are extremely high, quick response time seems very valuable (meaning having a human in the loop becomes costly) and international agreements around safety seem hard to imagine without massive geopolitical changes.

Two Alternatives to Logical Counterfactuals

OK, so no "backwards causation" ? (not sure if that's a technical term and/or if I'm using it right...)


Is there a word we could use instead of "linear", which to an ML person sounds like "as in linear algebra"?

Towards a mechanistic understanding of corrigibility

OK, thanks.

The TL;DR seems to be: "We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans."

This seems good... can you confirm my understanding below is correct?

2) RE: "How much utility is available": I guess we can just set a targeted level of utility gain, and it won't matter if there are plans we'd consider reasonable that would exceed that level? (e.g. "I'd be happy if we can make 50% more paperclips at the same cost in the next year.")

1) RE: "A lower bound": this seems good because we don't need to know how extreme catastrophes could be, we can just say: "If (e.g.) the earth or the human species ceased to exist as we know it within the year, that would be catastrophic".

Towards a mechanistic understanding of corrigibility

I generally don't read links when there's no context provided, and think it's almost always worth it (from a cooperative perspective) to provide a bit of context.

Can you give me a TL;DR of why this is relevant or what your point is in posting this link?

Towards a mechanistic understanding of corrigibility

I'm not sure it's the same thing as alignment... it seems there's at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:

  • "classic notion of alignment": The AI has the correct goal (represented internally, e.g. as a reward function)
  • "CIRL notion of alignment": AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner's mind)
  • "corrigibility": something else
Towards a mechanistic understanding of corrigibility

What do you mean "these things"?

Also, to clarify, when you say "not going to be useful for alignment", do you mean something like "...for alignment of arbitrarily capable systems"? i.e. do you think they could be useful for aligning systems that aren't too much smarter than humans?

Towards a mechanistic understanding of corrigibility

So IIUC, you're advocating trying to operate on beliefs rather than utility functions? But I don't understand why.

Towards a mechanistic understanding of corrigibility
We could instead verify that the model optimizes its objective while penalizing itself for becoming more able to optimize its objective.

As phrased, this sounds like it would require correctly (or at least conservatively) tuning the trade-off between these two goals, which might be difficult.

Load More