LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.
Unless you're posing a non-smooth model where we're keeping them at bay now but they'll increase later on?
This is what the "alignment is hard" people have been saying for a long time. (Some search terms here include "treacherous turn" and "sharp left turn")
https://www.lesswrong.com/w/treacherous-turn
A central AI alignment problem: capabilities generalization, and the sharp left turn
(my bad, hadn't read the post at the time I commented so this presumably came across cluelessly patronizing)
To preempt a possible misunderstanding, I don't mean "don't try to think up new metaethical ideas", but instead "don't be so confident in your ideas that you'd be willing to deploy them in a highly consequential way, or build highly consequential systems that depend on them in a crucial way".
I think I had missed this, but, it doesn't resolve the confusion in my #2 note. (like, still seems like something is weird about saying "solve metaphilosophy such that every can agree is correct" is more worth considering than "solve metaethics such that everyone can agree is correct". I can totally buy that they're qualitatively different and maybe have some guesses for why you think that. But I don't think the post spells out why and it doesn't seem that obvious to me)
Hmm, I like #1.
#2 feels like it's injecting some frame that's a bit weird to inject here (don't roll your own metaethics... but rolling your own metaphilosophy is okay?)
But also, I'm suddenly confused about who this post is trying to warn. Is it more like labs, or more like EA-ish people doing a wider variety of meta-work?
What are you supposed to do other than roll your own metaethics?
Mostly this has only been a sidequest I periodically mull over in the background. (I expect to someday focus more explicitly on it, although it might be more in the form of making sure someone else is tackling the problem intelligently).
But, I did previously pose this as a kind of open question re What are important UI-shaped problems that Lightcone could tackle? and JargonBot Beta Test (this notably didn't really work, I have hopes of trying again with a different tack). Thane Ruthenis replied with some ideas that were in this space (about making it easier to move between representations-of-a-problem)
https://www.lesswrong.com/posts/t46PYSvHHtJLxmrxn/what-are-important-ui-shaped-problems-that-lightcone-could
I think of many Wentworth posts as relevant background:
My personal work so far has been building a mix of exobrain tools that are more, like, for rapid prototyping of complex prompts in general. (This has mostly been a side project I'm not primarily focused on atm)
FYI, normally when I'm thinking about this, it's through the lens "how do we help the researchers working on illegible problems", moreso than "how do we communicate illegibleness?".
This post happened to ask the question "can AI advisers help with the latter" so I was replying about that, but, for completeness, normally when I think about this problem I resolve it as "what narrow capabilities can we build that are helpful 'to the workflow' of people solving illegible problems, that aren't particularly bad from a capabilities standpoint".
Re "can AI advisors help?"
A major thread of my thoughts these days is "can we make AI more philosophically competent relative their own overall capability growth?". I'm not sure if it's doable because the things you'd need to be good at philosophy are pretty central capabilities-ish-things. (i.e. ability to reason precisely, notice confusion, convert confusion into useful questions, etc)
Curious if you have any thoughts on that.
(This sounds like a good blogpost title-concept btw, maybe for a slightly different post. i.e "Decisionmakers need to understand the illegible problems of AI")
Makes sense, although want to flag one more argument that, the takeaways people tend to remember from posts are ones that are encapsulated in their titles. "Musings on X" style posts tend not to be remembered as much, and I think this is a fairly important post for people to remember.
apologies, I hadn't actually read the post at the time I commented here.
In an earlier draft of the comment I did include a line that was "but, also, we're not even really at the point where this was supposed to be happening, the AIs are too dumb", I removed it in a pass that was trying to just simplify the whole comment.
But as of last-I-checked (maybe not in the past month), models are just nowhere near the level of worldmodeling/planning competence where scheming behavior should be expected.
(Also, as models get smart enough that this starts to matter: the way this often works in humans is human's conscious planning verbal loop ISN'T aware of their impending treachery, they earnestly believe themselves when they tell the boss "I'll get it done" and then later they just find themselves goofing off instead, or changing their mind)