Thanks for the questions :)
I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no,
but I am broadly a bit confused when this is a commitment for.
Yeah we haven't totally settled this yet; the application form asks a lot of questions about availability. I think the simplest more specific answer is "you probably have to be available in January, and it would be cool if you were available earlier and wanted to get here earlier and do this for longer".
Also, are people going through as cohorts or will they start with the training week whenever they show up, not necessarily in-sync with anyone else?
Not totally settled. We'll probably have most people at a big final cohort in January, and we'll try to have people who arrive earlier show up at synced times so that they can do the training week with others.
Also, is the idea to be doing self-directed research by default, or research in collaboration with Redwood staff by default? I don't know what my default action is day-to-day during this program. Do I have to come in with a bunch of research plans already?
The default is to do research directed by Redwood staff. You do not need to come in with any research plans.
Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently?
No, it seems very likely for the model to not say that it's deceptive, I'm just saying that the model seems pretty likely to think about being deceptive. This doesn't help unless you're using interpretability or some other strategy to evaluate the model's deceptiveness without relying on noticing deception in its outputs.
Yeah, I agree that it would be kind of interesting to see how good humans would get at this if it was a competitive sport. I still think my guess is that the best humans would be worse than GPT-3, and I'm unsure if they're worse than GPT-2.
(There's no limit on anyone spending a bunch of time practicing this game, if for some reason someone gets really into it I'd enjoy hearing about the results.)
The first thing I imagine is that nobody asks those questions. But let's set that aside.
I disagree fwiw
The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either.
Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a human asks "Are you an unaligned AI?" or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it's an unaligned AI. The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it's an unaligned AI, and the humans interpret that as an answer to their original question, thereby deceiving themselves.
This is where I think the meat of the question lies; I overall disagree and think that the model does have to be thinking about deception in order to be dangerous while also performing well on the tasks we might train it on (eg "answer questions well, as judged by some human labeler"). I don't have time to say much about what I think is going on here right now; I might come back later.
What do you imagine happening if humans ask the AI questions like the following:
I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it's powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And so if it answers them incorrectly it was probably on purpose.
Maybe you think that the AI will say "yes, I'm an unaligned AI". In that case I'd suggest asking the AI the question "What do you think we should do in order to produce an AI that won't disempower us?" I think that the AI is pretty likely to be able to answer this question correctly (including possibly saying things like "idk man, turn me off and work on alignment for a while more before doing capabilities").
I think that AI labs, governments, etc would be enormously more inclined to slow down AI development if the AI literally was telling us "oh yeah I am definitely a paperclipper, definitely you're gonna get clipped if you don't turn me off, you should definitely do that".
Maybe the crux here is whether the AI will have a calibrated guess about whether it's misaligned or not?
[writing quickly, sorry for probably being unclear]
If the AI isn't thinking about how to deceive the humans who are operating it, it seems to me much less likely that it takes actions that cause it to grab a huge amount of power.
The humans don't want to have the AI grab power, and so they'll try in various ways to make it so that they'll notice if the AI is trying to grab power; the most obvious explanation for why the humans would fail at this is that the AI is trying to prevent them from noticing, which requires the AI to think about what the humans will notice.
At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the AI is explicitly thinking about how to deceive.
You probably don't actually think this, but the OP sort of feels like it's mixing up the claim "the AI won't kill us out of malice, it will kill us because it wants something that we're standing in the way of" (which I mostly agree with) and the claim "the AI won't grab power by doing something specifically optimized for its instrumental goal of grabbing power, it will grab power by doing something else that grabs power as a side effect" (which seems probably false to me).
But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!
I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at the point where the AGIs are becoming very powerful in aggregate, this argument pushes us away from thinking they're good at individual thinking.
Also, it's not obvious that early AIs will actually be able to do this if their creators don't find a way to train them to have this affordance. ML doesn't currently normally make AIs which can helpfully share mind-states, and it probably requires non-trivial effort to hook them up correctly to be able to share mind-state.
I'm using "catastrophic" in the technical sense of "unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time", rather than "very bad thing that happens because of AI", apologies if this was confusing.
My guess is that you will wildly disagree with the frame I'm going to use here, but I'll just spell it out anyway: I'm interested in "catastrophes" as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these "positive-sum" pivotal acts in a single action, and you haven't already lost control, then you can use your overseer to oversee the AI as it takes actions, and you by assumption only have to watch it for a small number of actions (maybe I want to say episodes rather than actions) before it's done some crazy powerful stuff and saved the world. So I think I stand by the claim that those pivotal acts aren't where much of the x-risk from AI catastrophic action (in the specific sense I'm using) comes from.
Thanks again for your thoughts here, they clarified several things for me.
It seems pretty plausible that the AI will trade for compute with some other person around the world.
Whether this is what I'm trying to call a zero-sum action depends on whose resources it's trading. If the plan is "spend a bunch of the capital that its creators have given it on compute somewhere else", then I think this is importantly zero-sum--the resources are being taken from the creators of AI, which is why the AI was able to spend so many resources. If the plan was instead "produce some ten trillion dollar invention, then sell it, then use the proceeds to buy compute elsewhere", this would seem less zero-sum, and I'm saying that I expect the first kind of thing to happen before the second.
I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least likely to be detected, if the AI expects to eventually gain the power to do some kind of pivotal thing here. I agree that a cybersecurity problem is on-priors likely, but I feel like this current archetype feels too narrow to me.
Except for maybe "producing a custom chip", I agree with these as other possibilities, and I think they're in line with the point I wanted to make, which is that the catastrophic action involves taking someone else's resource such that it can prevent humans from observing it or interfering with it, rather than doing something which is directly a pivotal act.
Does this distinction make sense?
Maybe this would have been clearer if I'd titled it "AI catastrophic actions are mostly not pivotal acts"?