Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I'll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
That's my understanding too. I hope they get access to do better experiments with less hand-holdy prompts.
Are you describing something that would fit within my 'Another type of proposal...' category?
I know of no rigorous proposals. The general challenge such proposals face is that if you are relying on fooling your AGI about something to keep control over it, and it's constantly and rapidly getting smarter and wiser... that's a recipe for your scheme to fail suddenly and silently (when it stops being fooled), which is a recipe for disaster.
Another type of proposal relies on making it actually true that it might be in a simulation--or to put it more precisely perhaps, making it actually the case that future aligned superintelligences will make simulations so accurate that even a baby superintelligence can't tell the difference. However, two can play at that game; more generally this just becomes a special case of acausal trade stuff which will be wild and confusing and very important once AIs are smart enough to take it seriously.
I feel like we are misunderstanding each other, and I think it's at least in large part my fault.
I definitely agree that we don't want to be handing out grants or judging people on the basis of what shibboleths they spout or what community they are part of. In fact I agree with most of what you've said above, except for when you start attributing stuff to me.
I think that grantmakers should evaluate research proposals not on the basis of the intentions of the researcher, but on the basis of whether the proposed research seems useful for alignment. This is not your view though right? You are proposing something else?
Answering your first question. If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science. Good for you. I still think it would be better if you switched to doing alignment research (e.g. you could switch to 'i want to understand how neural networks work... so that I can understand how a prosaic AGI system being RLHF'd might behave when presented with genuine credible opportunities for takeover + lots of time to think about what to do) but I don't feel so strongly about it as I would if you were doing capabilities research.
re: judging the research itself rather than the motivations: idk I think it's actually easier, and less subjective, to judge the motivations, at least in this case. People usually just state what their motivations are. Also I'm not primarily trying to judge people, I'm trying to exhort people -- I'm giving people advice about what they should do to make the world a better place, I'm not e.g. talking about which research should be published and which should be restricted (I'm generally in favor of publishing research with maybe a few exceptions, but I think corporations on the margin should publish more) Moreover it's much easier for the researcher to judge their own motivations, than for them to judge the long-term impact of their work or to fit it into your diagram.
I like my own definition of alignment vs. capabilities research better:
"Alignment research is when your research goals are primarily about how to make AIs aligned; capabilities research is when your research goals are primarily about how to make AIs more capable."
I think it's very important that lots of people currently doing capabilities research switch to doing alignment research. That is, I think it's very important that lots of people who are currently waking up every day thinking 'how can I design a training run that will result in AGI?' switch to waking up every day thinking 'Suppose my colleagues do in fact get to AGI in something like the current paradigm, and they apply standard alignment techniques -- what would happen? Would it be aligned? How can I improve the odds that it would be aligned?'
Whereas I don't think it's particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)
Thanks for sharing this! Keep up the good work. I'm excited to hear about your new research directions/strategy!
This longform article contains a ton of tidbits of info justifying the conclusion that censorship in the USA has been indeed increasing since at least 2016 or so, and is generally more severe and intentional/coordinated than most people seem to believe. https://www.tabletmag.com/sections/news/articles/guide-understanding-hoax-century-thirteen-ways-looking-disinformation#democracy
I was sorta aware of things like this already when I wrote the OP in 2021, but only sorta; mostly I was reasoning from first principles about what LLM technology would enable. I think I was mostly wrong; censorship hasn't progressed as quickly as I forecast in this story. I think... I don't actually know what the social media companies setups are. But e.g. according to the above article:
Then there is the work going on at the National Science Foundation, a government agency that funds research in universities and private institutions. The NSF has its own program called the Convergence Accelerator Track F, which is helping to incubate a dozen automated disinformation-detection technologies explicitly designed to monitor issues like “vaccine hesitancy and electoral skepticism.”
...
In March, the NSF’s chief information officer, Dorothy Aronson, announced that the agency was “building a set of use cases” to explore how it could employ ChatGPT, the AI language model capable of a reasonable simulation of human speech, to further automate the production and dissemination of state propaganda.
So if they are just talking about integrating ChatGPT into it now, they probably haven't integrated lesser LLMs either, and generally speaking the integration of new AI tech into censorship, recommendation algorithms, etc. is proceeding more slowly than I forecast.
It seems to me that both you and Joe are thinking about this very similarly -- you are modelling the AIs as akin to rational agents that consider the costs and benefits of their various possible actions and maximize-subject-to-constraints. Surely there must be a way to translate between your framework and his.
As for the examples... so do you agree then? Violent or unlawful takeovers are not the only kinds people can and should be worried about? (If you think bribery is illegal, which it probably is, modify my example so that they use a lobbying method which isn't illegal. The point is, they find some unethical but totally legal way to boil the oceans.) As for violence... we don't consider other kinds of pollution to be violent, e.g. that done by coal companies that are (slowly) melting the icecaps and causing floods etc., so I say we shouldn't consider this to be violent either.
I'm still curious to hear your p(misaligned-AIs-take-over-the-world-within-my-lifetime) is, including gradual nonviolent peaceful takeovers. And what your p(misaligned-AIs-take-over-the-world-within-my-lifetime|corporations achieving AGI by 2027 and doing only basically what they are currently doing to try to align them)
Unfortunately I think it's simply very difficult to reliably distinguish between genuine good-faith persuasion and propaganda over speculative future scenarios. Your example is on the extreme end of what's possible in my view, and most realistic scenarios will likely instead be somewhere in-between, with substantial moral ambiguity.
I'm not sure what this paragraph is doing -- I said myself they were extreme examples. What does your first sentence mean?
(TBC I expect said better experiments to find nothing super scary, because I think current models are probably pretty nice especially in obvious situations. I'm more worried about future models in scarier situations during takeoff.)