plex — AI Alignment Forum

I've got a slightly terrifying hail mary "solve alignment with this one weird trick"-style paradigm I've been mulling over for the past few years which seems like it has the potential to solve corrigibility and a few other major problems (notably value loading without Goodharting, using an alternative to CEV which seems drastically easier to specify). There are a handful of challenging things needed to make it work, but they look to me maybe more achievable than other proposals which seem like they could scale to superintelligence I've read.

Realistically I am not going to publish it anytime soon given my track record, but I'd be happy to have a call with anyone who'd like to poke my models and try and turn it into something. I've had mildly positive responses from explaining it to Stuart Armstrong and Rob Miles, and everyone else I've talked to about it at least thought it was creative and interesting.

LLMs for Alignment Research: a safety priority?

plex2y20

We're likely to switch to Claude 3 soon, but currently GPT 3.5. We are mostly expecting it to be useful as a way to interface with existing knowledge initially, but we could make an alternate prompt which is more optimized for being a research assistant brainstorming new ideas if that was wanted.

Would it be useful to be able to set your own system prompt for this? Or have a default one?

Solve Corrigibility Week

plex4y00

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments