Issa Rice

I am Issa Rice.


Introduction to Cartesian Frames

So the existence of this interface implies that A is “weaker” in a sense than A’.

Should that say B instead of A', or have I misunderstood? (I haven't read most of the sequence.)

The Alignment Problem: Machine Learning and Human Values

Does anyone know how Brian Christian came to be interested in AI alignment and why he decided to write this book instead of a book about a different topic? (I haven't read the book and looked at the Amazon preview but couldn't find the answer there.)

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

HCH is the result of a potentially infinite exponential process (see figure 1) and thereby, computationally intractable. In reality, we can not break down any task into its smallest parts and solve these subtasks one after another because that would take too much computation. This is why we need to iterate distillation and amplification and cannot just amplify.

In general your post talks about amplification (and HCH) as increasing the capability of the system and distillation as saving on computation/making things more efficient. But my understanding, based on this conversation with Rohin Shah, is that amplification is also intended to save on computation (otherwise we could just try to imitate humans). In other words, the distillation procedure is able to learn more quickly by training on data provided by the amplified system compared to just training on the unamplified system. So I don't like the phrasing that distillation is the part that's there to save on computation, because both parts seem to be aimed at that.

(I am making this comment because I want to check my understand with you or make sure you understand this point because it doesn't seem to be stated in your post. It was one of the most confusing things about IDA to me and I'm still not sure I fully understand it.)

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

I still don't understand how corrigibility and intent alignment are different. If neither implies the other (as Paul says in his comment starting with "I don't really think this is true"), then there must be examples of AI systems that have one property but not the other. What would a corrigible but not-intent-aligned AI system look like?

I also had the thought that the implicative structure (between corrigibility and intent alignment) seems to depend on how the AI is used, i.e. on the particulars of the user/overseer. For example if you have an intent-aligned AI and the user is careful about not deploying the AI in scenarios that would leave them disempowered, then that seems like a corrigible AI. So for this particular user, it seems like intent alignment implies corrigibility. Is that right?

The implicative structure might also be different depending on the capability of the AI, e.g. a dumb AI might have corrigibility and intent alignment equivalent, but the two concepts might come apart for more capable AI.

My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda

IDA tries to prevent catastrophic outcomes by searching for a competitive AI that never intentionally optimises for something harmful to us and that we can still correct once it’s running.

I don't see how the "we can still correct once it’s running" part can be true given this footnote:

However, I think at some point we will probably have the AI system autonomously execute the distillation and amplification steps or otherwise get outcompeted. And even before that point we might find some other way to train the AI in breaking down tasks that doesn’t involve human interaction.

After a certain point it seems like the thing that is overseeing the AI system is another AI system and saying that "we" can correct the first AI system seems like a confusing way to phrase this situation. Do you think I've understood this correctly / what do you think?

What are the high-level approaches to AI alignment?
How does iterated amplification exceed human abilities?

I'm confused about the tradeoff you're describing. Why is the first bullet point "Generating better ground truth data"? It would make more sense to me if it said instead something like "Generating large amounts of non-ground-truth data". In other words, the thing that amplification seems to be providing is access to more data (even if that data isn't the ground truth that is provided by the original human).

Also in the second bullet point, by "increasing the amount of data that you train on" I think you mean increasing the amount of data from the original human (rather than data coming from the amplified system), but I want to confirm.

Aside from that, I think my main confusion now is pedagogical (rather than technical). I don't understand why the IDA post and paper don't emphasize the efficiency of training. The post even says "Resource and time cost during training is a more open question; I haven’t explored the assumptions that would have to hold for the IDA training process to be practically feasible or resource-competitive with other AI projects" which makes it sound like the efficiency of training isn't important.

How does iterated amplification exceed human abilities?

The addition of the distillation step is an extra confounder, but we hope that it doesn't distort anything too much -- its purpose is to improve speed without affecting anything else (though in practice it will reduce capabilities somewhat).

I think this is the crux of my confusion, so I would appreciate if you could elaborate on this. (Everything else in your answer makes sense to me.) In Evans et al., during the distillation step, the model learns to solve the difficult tasks directly by using example solutions from the amplification step. But if can do that, then why can't it also learn directly from examples provided by the human?

To use your analogy, I have no doubt that a team of Rohins or a single Rohin thinking for days can answer any question that I can (given a single day). But with distillation you're saying there's a robot that can learn to answer any question I can (given a single day) by first observing the team of Rohins for long enough. If the robot can do that, why can't the robot also learn to do the same thing by observing me for long enough?

How special are human brains among animal brains?

It seems like "agricultural revolution" is used to mean both the beginning of agriculture ("First Agricultural Revolution") and the 18th century agricultural revolution ("Second Agricultural Revolution").

What are some exercises for building/generating intuitions about key disagreements in AI alignment?

I have only a very vague idea of what you mean. Could you give an example of how one would do this?

Load More