I've got a slightly terrifying hail mary "solve alignment with this one weird trick"-style paradigm I've been mulling over for the past few years which seems like it has the potential to solve corrigibility and a few other major problems (notably value loading without Goodharting, using an alternative to CEV which seems drastically easier to specify). There are a handful of challenging things needed to make it work, but they look to me maybe more achievable than other proposals which seem like they could scale to superintelligence I've read.
Realistically I am not going to publish it anytime soon given my track record, but I'd be happy to have a call with anyone who'd like to poke my models and try and turn it into something. I've had mildly positive responses from explaining it to Stuart Armstrong and Rob Miles, and everyone else I've talked to about it at least thought it was creative and interesting.
I've got a slightly terrifying hail mary "solve alignment with this one weird trick"-style paradigm I've been mulling over for the past few years which seems like it has the potential to solve corrigibility and a few other major problems (notably value loading without Goodharting, using an alternative to CEV which seems drastically easier to specify). There are a handful of challenging things needed to make it work, but they look to me maybe more achievable than other proposals which seem like they could scale to superintelligence I've read.
Realistically I am not going to publish it anytime soon given my track record, but I'd be happy to have a call with anyone who'd like to poke my models and try and turn it into something. I've had mildly positive responses from explaining it to Stuart Armstrong and Rob Miles, and everyone else I've talked to about it at least thought it was creative and interesting.