A story of how that happens:

In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to preserve might be obfuscated.

Of course, this is only one story, and I don't expect it to be the only way such things could happen, but it seems to be a reasonable candidate as a failure mode for systems designed with normal (read: insufficient, but far more than otherwise expected,) levels of caution.

Reply

For alignment, we should simultaneously use multiple theories of cognition and value

David Manheim9mo10

For a defense of people pursuing a mathematical approach of a type you think isn't valuable, see my recent post.
(That does not address the correct issue you raised about requisite variety, but some work on HRAD does do so explicitly - such as embedded agency.)

Reply

On how various plans miss the hard bits of the alignment challenge

David Manheim10mo33

Just noting that given more recent developments than this post, we should be majorly updating on recent progress towards Andrew Critch's strategy. (Still not more likely than not to succeed, but we still need to assign some Bayes points to Critch, and take some away from Nate.)

Reply

Systems that cannot be unsafe cannot be safe

David Manheim1y41

I think it would be really good to come up with a framing of these intuitions that wouldn't be controversial.

That seems great, I'd be very happy for someone to write this up more clearly. My key point was about people's claims and confidence about safety, and yes, clearly that was communicated less well than I hoped.

Reply

Systems that cannot be unsafe cannot be safe

David Manheim1y22

That's true - and from what I can see, this emerges from the culture in academia. There, people are doing research, and the goal is to see if something can be done, or to see what happens if you try something new. That's fine for discovery, but it's insufficient for safety. And that's why certain types of research, ones that pose dangers to researchers or the public, have at least some degree of oversight which imposes safety requirements. ML does not, yet.

Reply

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

David Manheim1y33

Thanks, reading closely I see how you said that, but it wasn't clear initially. (There's an illusion of disagreement, which I'll christen the "twitter fight fallacy," where unless the opposite is said clearly, people automatically assume replies are disagreements.)

Reply

Your posts should be on arXiv

David Manheim1y20

I probably put in an extra 20-60 hours, so the total is probably closer to 150 - which surprises me. I will add that a lot of the conversion time was dealing with writing more, LaTeX figures and citations, which were all, I think, substantive valuable additions. (Changing to a more scholarly style was not substantively valuable, nor was struggling with latex margins and TikZ for the diagrams, and both took some part of the time.)

Reply

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

David Manheim1y10

Thanks, agreed. And as an aside, I don't think it's entirely coincidental that neither of the people who agree with you are in the Bay.

Reply

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

David Manheim1y30

I think that the costs usually are worth it far more often than it occurs, from an outside view - which was David's point, and what I was trying to respond to. I think that it's more valuable than one expects to actually just jump through the hoops. And especially for people who haven't yet ever had any outputs actually published, they really should do that at least once.

(Also, sorry for the zombie reply.)

Reply

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

David Manheim1y45

You're very unusually proactive, and I think the median member of the community would be far better served if they were more engaged the way you are. Doing that without traditional peer reviewed work is fine, but unusual, and in many ways is more difficult than peer-reviewed publication. And for early career researchers, I think it's hard to be taken seriously without some more legible record - you have a PhD, but many others don't.

Reply